Estimate the number of USN records on NTFS volume

https://stackoverflow.com/questions/11336390

19-06-2021
|

Pergunta

When the USN journal is used for the first time, the volume's entire set of USN records must be enumerated using the FSCTL_ENUM_USN_DATA control code. This is usually a lengthy operation.

Is there a way to estimate the number of records on the volume prior to running it, so progress can be displayed?

I'm guessing the USN data for the entire volume is generated from the MFT, with one record per file (approximately). So perhaps a way to estimate the number of active files in the MFT would work.

Solução

You can use FSCTL_GET_NTFS_VOLUME_DATA to get the length in bytes of the MFT. If you compare this to the number of records on a selection of representative volumes, you could estimate the average length of a single MFT record and use this to calculate an estimate for the number of records on a particular volume.

Because the MFT contains (for example) the security information for every file, the average length will vary significantly from volume to volume, so I think you'll only get order-of-magnitude accuracy, but it may be good enough in most cases.

Another approach would be to assume that the file reference numbers increase linearly, which is roughly true. You can use FSCTL_ENUM_USN_DATA to find out whether there are any files with a reference number above a particular guess or not; you'd need no more than 128 guesses to determine the actual maximum reference number. That would at least give you a percentage complete between 0 and 100 at any given point, it wouldn't be entirely uniform but then progress bars never are. :-)

Additional:

Looking more closely, on Windows 7 x64 the "next id" field returned by FSCTL_ENUM_USN_DATA (the quadword returned before the first USN_RECORD structure) isn't a file reference number after all, but the file record segment number. So, as you observed, the last id number returned, multiplied by BytesPerFileRecordSegment (1024), is equal to MftValidDataLength.

File reference numbers appear to be made up of two parts. The low six bytes contain the file record segment number. The first record returned from each request always has a FRN whose segment number is the same as the "next id" fed into StartFileReferenceNumber, except for the first call when StartFileReferenceNumber is zero. The upper two bytes contain unspecified additional information, which is never zero.

It seems that FSCTL_ENUM_USN_DATA accepts either a file record segment number (in which case the top two bytes are zero) or a file reference number (in which case the top two bytes are nonzero).

One oddity is that I can't find two records with the same record segment number. This suggests that each file record is using at least 1K in the MFT, which doesn't seem reasonable.

Anyway, the upshot is that it is probably sensible to multiply the "next id" by BytesPerFileRecordSegment and divide it by MftValidDataLength to get a percentage completed, so long as you cope gracefully if this returns a nonsensical result.

Outras dicas

In fact the MftValidDataLength field of the NTFS_VOLUME_DATA_BUFFER / NTFS_EXTENDED_VOLUME_DATA structure(s) place an upper limit on the number of USN records that will/would be returned by FSCTL_ENUM_USN_DATA (that is, assuming additional records aren't added to the journal between the time that you measure the estimate and the enumeration...)

In the C# example below, I divide the vd.MftValidDataLength value by vd.BytesPerFileRecordSegment, being sure to round-up by first adding dividend - 1 before dividing. As for the divisor, I believe that its value here is always universally 1,024 on any platform or system, in case you prefer to hard-code it.

[Serializable, StructLayout(LayoutKind.Sequential)]
public struct NTFS_EXTENDED_VOLUME_DATA
{
    public VOLUME_ID     /**/ VolumeSerialNumber;
    public long          /**/ NumberSectors;
    public long          /**/ TotalClusters;
    public long          /**/ FreeClusters;
    public long          /**/ TotalReserved;
    public uint          /**/ BytesPerSector;
    public uint          /**/ BytesPerCluster;
    public int           /**/ BytesPerFileRecordSegment;   // <--
    public uint          /**/ ClustersPerFileRecordSegment;
    public long          /**/ MftValidDataLength;          // <--
    public long          /**/ MftStartLcn;
    public long          /**/ Mft2StartLcn;
    public long          /**/ MftZoneStart;
    public long          /**/ MftZoneEnd;
    public uint          /**/ ByteCount;
    public ushort        /**/ MajorVersion;
    public ushort        /**/ MinorVersion;
    public uint          /**/ BytesPerPhysicalSector;
    public ushort        /**/ LfsMajorVersion;
    public ushort        /**/ LfsMinorVersion;
    public uint          /**/ MaxDeviceTrimExtentCount;
    public uint          /**/ MaxDeviceTrimByteCount;
    public uint          /**/ MaxVolumeTrimExtentCount;
    public uint          /**/ MaxVolumeTrimByteCount;
};

Typical constants, abridged for clarity:

public enum FSCTL : uint
{
    // etc...     etc...
    FILESYSTEM_GET_STATISTICS   /**/ = (9 << 16) | 0x0060,
    GET_NTFS_VOLUME_DATA        /**/ = (9 << 16) | 0x0064,  // <--
    GET_NTFS_FILE_RECORD        /**/ = (9 << 16) | 0x0068,
    GET_VOLUME_BITMAP           /**/ = (9 << 16) | 0x006f,
    GET_RETRIEVAL_POINTERS      /**/ = (9 << 16) | 0x0073,
    // etc...     etc...
    ENUM_USN_DATA               /**/ = (9 << 16) | 0x00b3,
    READ_USN_JOURNAL            /**/ = (9 << 16) | 0x00bb,
    // etc...     etc...
    CREATE_USN_JOURNAL          /**/ = (9 << 16) | 0x00e7,
    // etc...     etc...
};

Pseudo-code follows, since everyone has their own favorite ways of doing P/Invoke...

// etc..

if (!GetDeviceIoControl(h_vol, FSCTL.GET_NTFS_VOLUME_DATA, out NTFS_EXTENDED_VOLUME_DATA vd))
    throw new Win32Exception(Marshal.GetLastWin32Error());

var c_mft_estimate = (vd.MftValidDataLength + (vd.BytesPerFileRecordSegment - 1))
                                                        / vd.BytesPerFileRecordSegment;

Great, so what can you do with this value? Unfortunately, knowing this maximum cap on the number of USN records that FSCTL_ENUM_USN_DATA will return doesn't help with choosing a buffer size for the DeviceIoControl/FSCTL_ENUM_USN_DATA call themselves, since the USN_RECORD structures returned in each iteration vary in size according to the length of the reported filenames.

So while it is true that, if you happen to provide a buffer large enough for all of the USN_RECORD structures, then DeviceIoControl will indeed dutifully provide them all to you in a single call (thus avoiding the complication of an iterative-calling loop, which simplifies the code considerably), the little calculation above doesn't give any principled estimation of that buffer size, unless you're willing to settle for using it towards some kind of gross overestimation.

What the value is useful for, rather, is for pre-allocating your own fixed-size data structures, which you'll surely need, prior to the FSCTL_ENUM_USN_DATA enumeration operation. So if you have your own value-type which you'll create for each USN entry (dummy struct, just for example...)

[StructLayout(LayoutKind.Sequential)]
public struct MFT_IX_REC
{
    public ushort seq;
    public ushort parent_ix_hi;
    public uint parent_ix;
};

Then, using the estimate from above, you can pre-allocate an array of these before the DeviceIoControl and never have to worry about resizing during the iteration.

var med = new MFT_ENUM_DATA { ... };
// ...

var rg_mftix = new MFT_IX_REC[c_mft_estimate];
// ... ready to go, without having to check whether the array needs resizing within the loop

for (int i=0; DeviceIoControl(h_vol, FSCTL.ENUM_USN_DATA, in med, out USN_RECORD usn, ...); i++)
{
    // etc..
    rg_mftix[i].parent_ix = (uint)usn.ParentId;
    // etc..
}

This elimination of the dynamic array-resizing, usually needed when you don't know the number of entries in advance, is a non-trivial performance benefit, because it avoids the expensive jumbo-sized memcpy operations required for copying the existing data from the old array to a new, larger one each time you resize.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow