I need to examine an 82.7 GB (!) text file. What can open it?

https://apple.stackexchange.com/questions/382263

26-05-2021
|

Question

We recently had a meltdown of a Tomcat server, that produced an 82.7 GB "catalina.out" log file, which I saved for forensic analysis.

What macOS editors can open monster text files without consuming 80 GB of RAM or causing 15 minute freezes?

Solution

Try Glogg. There is a MacOs build on the download page:

https://glogg.bonnefon.org/download.html

I don't know about 80 GB files, but I regularly used it (on Windos) to open log files up to 5 GB, and it works great on those (memory footprint after indexing is about 100-150MB, and searching is very fast).

One note though - it's read-only analyzer, not an editor.

OTHER TIPS

less filename

From the command line, it lets you view files straightaway without loading the full file into memory.

I would not try to open it... I'd rather do:

grep - look for some text
split - chop the file into say 10Mb chunks.

Something like:

grep "crash" My80GbFile.txt | more

If the big file is not "Line delimited"

split -b 10M My80GbFile.txt

But if the big file is just a load of lines, then (as was posted), split by line (100,000 per sub-file) in this case.

 split -l 100000 My80GbFile.txt

In terms of your immediate needs, the best free visual editor for macOS is BBEdit (linked to the Mac App Store download) and it does so much - a true powerhouse. Once you have it, you can also pay up for the pro / automation / out of gratitude features, but it's free forever if you want and like that price.

I also use vi to edit things, but that opens a can of worms for needing the shell, terminal app or another app and some studying to learn how to exit the editor (tldr; try ZZ or ZQ), customize it and teach your brain to think about operating on text in the abstract as opposed to using the mouse to select items. Also, a pager like less or more or bat is also very friendly to get started and navigate around massive files. (And bat gives you ~~wings~~ awesome colors and syntax awareness).

brew install bat

In your case, the console app that comes with macOS might also be worth looking at if you can use the search functionality there. Launch the app from spotlight and drag your monster file on the window to have a peek.

Just don't (open it as ONE file)

Is there any specific reason that you can not simply break it into about 1GB chunks with a script?

Yes, searching and similar functionality will suffer but that will already be the case with a 80GB file.

If you have specific break points in the script (days in the timestamp, startup / shutdown messages) you could also split it up for that. That way you would probably even get additional meaning into the file.

Also: once split up any decent IDE (like IntelliJ IDEA or any other) will give you search functionality over the text back.

[Beware: This comes from a programmer so might not be your approach or overkill, I can only say that it WOULD work in the end, you'll have to know if its worth it]

Use less in a terminal window. It will show you one page at a time of the file, will only load about that much in memory, so you can navigate multi-TB files with it if you want.

You probably should add the -n option to prevent less from trying to compute line numbers. So:
```
less -n /path/to/file
```
Remember you can type less -n (don't forget the final space) and drag-and-drop the file from the Finder to the Terminal window to add the path to that file.
Once you are viewing the file in less, you can:
- navigate using up/down arrows, space (one page down), b (one page back)...
- search using /. You can also search for lines not containing a pattern with /!. Reverse search uses ?. But all searches will scan the whole file. Better have it on an SSD if you do that a lot.
- navigate to a specific line in the file using <number> followed by G (capital G)
- navigate to a specific part of the file using <number> followed by %. So 50% will get you to the middle of the file, 90% to the last 10%, etc.

If your log file has timestamps and you know when you want to look, the quickest approach is to:

open the file
Use a "binary search" to find the rough part of the file you are interested in:
- Type 50%, which will show you the middle of the file
- If the part you want is after, go to 75%, otherwise 25%
- Repeat until you have narrowed down to the relevant part
Use a regular search (using / to go forward or ? to go backwards) to find the exact line you're looking for (based on either the exact timestamp, or a specific word you know shows the issue).

This should allow you to navigate quickly quickly to the relevant part of the file.

If you think you'll have a lot of searching within a subset of the file, you could alternatively use grep with a specific date or date-time combination (in the right format) to first extract that subset to another smaller file. For instance, if you know the crash occurred today a bit after noon while your log covers months, you could

grep '2020-02-17 12:' /path/to/file > extracted-log.txt

This would give you all lines which contain a timestamp between 12:00:00 and 12:59:59 inclusive. Of course, the exact format will depend on the actual format used for timestamps.

grep will scan the whole file once to find all the relevant lines, which will take a little while on a very large file, but you'll then have a much more manageable file.

An alternative may be to use dd to "extract" a part of the original file, using offsets and lengths found in less (Ctrl-G to get the current offset). dd is a very powerful tool but can be very dangerous to use, so use with caution (and most definitely not as root or with sudo if you are not 100% sure of what you're doing):

dd if=/path/to/original/file of=destination_file.txt bs=1 skip=<start offset> count=<length>

Note that this is not very efficient, it's better to use a larger block size (bs), ideally a power of 2 such as 1024, and divide skip and count by that block size.

I'm pretty sure there must be other tools that do the same, though I'm drawing a blank. I think some versions of cat can do it, but not the one on macOS apparently.

With disk based text editors, the file is not loaded entirely into memory - what you see on the UI is a peek into the contents the editor has loaded into memory. I have used UltraEdit successfully in the past to do large log file analysis. Its regex based search tools and location bookmarks are especially useful. It loads the file snappily, and you can do regular expression based searches. The url takes you to a download page where you can download a 30 day trial version. There are other disk based text editors as well.

Since it has been a few years, I installed UltraEdit and opened the largest file I had. It was a 64 GB binary file and it opened instantly. Ran a search for a term and that took about 90 seconds. I have highlighted the file size with a red rectangle in the bottom right. The mac is an MBP 2018 with 8 GB RAM running Mojave.

You wouldn't

Even a Tolkien fan doesnt want 82.7GB of anything. You only want certain bits out of that; you'll know it when you see it.

And even contemplating a tool that analyzes the whole file is a waste of time, literally; it's going to spend 15 minutes just reading through the file assuming 100MB/sec. A lot slower if it's doing analysis of any complexity.

Terminal is your friend

The lifesaver here is that OS X is built on top of Unix. That was a big part of Apple buying NeXT and Steve Jobs coming back. That means you can use the entire suite of Unix tools, which are extremely well-honed, and very well supported here.

There are dozens of ways to do it without perl, but since perl is built into MacOS and is infinitely extensible, I prefer to start there (rather than do it in a simpler tool, want to improve the query somewhat, hit the limits of that tool, and have to re-craft it in a different tool). So something like this in a file called, say "xx":

 $len = -s "filename.log";            # variable becomes length of file
 open ($IN,  "<", "filename.log"); 
 seek ($IN, $len - 10_000_000, 0);    # perl allows _ in numbers for readability

 while (<$IN>) {         # <> reads a line.  Default variable is metavariable $_
   print;                # with no argument, default is metavariable $_
 }

That won't read the whole file, just seek to the specified location (10MB from the end), then read and print everything to the end. It will just print it to the screen, so to send it to the file, do this when you call it:

 perl xx > tailfile.txt

Now you have a 10MB tailfile.txt that you can open with something else.

There are simpler ways to do just that, but suppose you realize "Wait, I want to do more. I only want errors and warnings.” So you change the print command to

 print if /error/i or /warning/i;    # // matches text, defaults to $_

That too can be accomplished in simpler tools if you spend enough time rooting through docs. But then, you decide you need to see the three lines after the error. Just like that... you've outgrown the simpler tools, but this is trivial in Perl. You can just keep shimming Perl pretty much forever. There's a full programming language in there. Object oriented and everything.

A file that large is probably 99.999999% redundant (literally), so the key is to remove lines that occur a zillion times, to some degree of similarity, and examine what's left over.

On Linux there's a utility called petit, designed for analyzing huge log files, that does this. An example usage is petit --hash /var/log/kern.log. The utility can probably be found or built for Mac.

It processes each line of the file to remove things that make the line unique; for example, strip the date from each line, and substitute all strings of digits with a single # character. Each generic line is then hashed to become a fingerprint for detection of similar lines.

The result is that it outputs each line only once with a count of occurrences, vastly reducing the size of the data. Anything out of the ordinary is likely to show up clearly, and then one can search for that specifically, using utilities from some of the other answers here.

I don't know if this particular utility is performant enough for something that size. I would bet yes, because it has options for plotting graphs on the order of months or years of input, and wouldn't need to store much besides a small number of fingerprints. In the worst case you could write your own: for each input line, genericize it to a fingerprint, hash it, and add it to a database of hash+fingerprint+count, indexed by hash.

EDIT: petit seems to use more CPU and memory than desired, so I wrote my own simple implementation: https://github.com/curtmcd/hashlog. It makes one pass through the log file; it processes at about 6.5 sec/GB on my home Ubuntu server.

"joe", aka Joe's Own Editor, was designed to only load parts of the file as needed. I've never used it on a file that large but I never came across a text.file too large for it to open.

Definitely Hex Fiend. It opens files WITHOUT using RAM. It simply reads from the disk. Performance is absolutely incredible. I've examined 500gb password dumps with it before.

https://ridiculousfish.com/hexfiend/

Open terminal and use vim to open it

vim filename.txt

P/s:

Type vim and drag the file to your terminal. Then hit enter.

To quit vim (without editing):

:q!

I would recommend using Sublime Text. Although it requires a license it can be downloaded and evaluated for free without time or functionality limitations. That means that you or your company can have the chance to try it out as much and however you want. I personally use it for investigating logs of maybe even 3-4GB in most cases, or SQL dumps of even 12GB. On initial openning it does go through the entire file in order to perform 1st level indexing etc, but it comes with a progress bar indicating the whole process's progress.

Licensed under: CC-BY-SA with attribution

Not affiliated with apple.stackexchange