Question

started working on a simple C program to scan for images files on a USB disk(BMP,Jpeg etc..). I completed the header files that will contain the image metadata.

my questions are regarding scanning the usb drive. how will the program know when it reaches the end of file. i am treating the usb drive like a file. I plan to read the raw data bytes using fread.

FILE usb_ptr = fopen(argv[1],"r");
if(usb_ptr == NULL){
    printf("error opening USB Drive for reading");
    fclose(usb_ptr);
}
  //I manually give the mount location, on fedora usb drives are mounted at
  //         /run/media/user1/USBDRIVE by default

struct header1 header1;
struct header2 header2;
struct colours colours;
int file_count=0;

fread(&header1,sizeof(header1),1,usb_ptr);
fread(&header2,sizeof(header2),1,usb_ptr);

after copying the first few bytes of the USB disk, perform a check to see if we found a BMP file, if its not a BMP scan in the next few bytes and so on.

 if (header1.signature != 0x4d42 || header1.data_offset != 54 ){
      int file_size = header1.file_size;
      file_count++;
  //there are more checks trying to keep this post short

1)I plan to iterate this process until I reach the end of the file. but how to I determine when the usb_ptr is at the end(i finished scanning the USB)?

2) I am pretty sure there will be "EOF" characters in the memory of the usb disk, how do I know for certain I have reached the end of the disk or just read some random byte on the usb disk?

3) should I go about this in a different way?

(the code above is not complete just snippets, also there is another section where i copy the image found on the usb disk to my hard disk this program is pretty much to recover images from a drive hope to add more file types later)

thanks.

Was it helpful?

Solution

My comments, summarized:

  1. fread will do what it always does at the end of a "file" (in this case, disk) and return the number of read bytes, i.e., most likely "0" (if you are reading per 512 bytes).

  2. EOF is not a 'byte' value you should be looking for, rather, it indicates a state. Use feof to explicitly test, or just check the return value of fread.

  3. Currently you are checking each and every single byte. But the data is not stored in any random order! USB sticks store data in sectors, each one 512 bytes long: "Sectors are 512 bytes long, for compatibility with hard drives" (wikipedia on USB flash drive).

  4. You cannot assume contiguous sectors belong to the same file due to fragmentation. If a file is fragmented, there is no automatic way to automatically merge the sectors in the correct order ... (Doing it manually is usually out of the question. I'd consider doing that only if the original file contains easy recognizable data such as plain text, and the contents are extremely important :) .)

You can read a sector -- 512 bytes -- and stop if you encounter EOF. If this sector starts with the two signature bytes for a BMP, you can inspect it further to verify it is a BMP header, and if so, you can use the BMP structure data to check if all next sectors contain a valid BMP file. The only way to do so is:

  • the first sector contains all relevant BMP metrics: data size indicates the original pixel size, and you should read that much extra data.
  • using the BMP file specifications, check if:
    • width times height times bytes per pixel equals total size
    • data does not contain out-of-range values (not possible for 24 bit images, though)
    • data is aligned to a DWORD per scan-line

If you accept the BMP as 'possible correct', you can save it to disk and verify by eye if it seems correct. Then:

  • you are 100% sure this file is well-formed; or
  • another image may start "inside" this one's data part due to fragmentation.

If it isn't a well-formed BMP image, or you want a thorough check of every sector, continue scanning with the next sector. If you are sure the image is well-formed throughout or you want to speed up scanning, you can skip (datasize+sectorsize-1)/sectorsize sectors.

The simple C program below scans an entire disk and if it seems to indicate a BMP file start, it prints out the first 32 bytes in human readable form. For my test disk, it gave the following output:

42 4D D8 49 EE 0E E8 B9 7A BE F3 7C DF FD 7E F7 77 9F 7B FF 38 7F F0 3C 24 33 B3 66 AD 77 BD 6B | BM.I....z..|..~.w.{.8..<$3.f.w.k
42 4D 6E E6 E3 D3 48 37 A5 27 D7 6F EF 49 4E 13 E0 A7 DF 78 47 8E 5E 3C 95 B5 0A 16 D2 5C CE 3A | BMn...H7.'.o.IN....xG.^<.....\.:
42 4D 36 00 24 00 00 00 00 00 36 00 00 00 28 00 00 00 00 04 00 00 00 03 00 00 01 00 18 00 00 00 | BM6.$.....6...(.................
42 4D 49 2C 20 62 6F 64 79 20 6D 61 73 73 20 69 6E 64 65 78 3B 20 41 53 41 2C 20 41 6D 65 72 69 | BMI, body mass index; ASA, Ameri
42 4D 50 66 6F 67 6C 65 00 00 00 00 00 00 29 1E 00 01 DC F8 BC 84 91 AE BC 84 91 AE 00 04 00 00 | BMPfogle......).................

The weird thing is, initially it contained no BMP files so I copied one to test with. Now how come there are more than one candidates? (There were actually 9 more.) First, there are "false positives" -- the "BMI" one is a nice example --, but second: if there is a deleted BMP file somewhere on that disk and its first sector happens to not have been overwritten, it will also be listed!

Short & rough sample code:

#include <stdio.h>

int main (int argc, char **argv)
{
    FILE *usb_ptr;
    unsigned char buffer[512];
    int i, j;

    if (argc == 1)
    {
        printf ("wot no stick?\n");
        return -1;
    }
    usb_ptr = fopen(argv[1],"rb");
    if(usb_ptr == NULL)
    {
        printf("error opening USB Drive for reading");
    }

    i = 0;
    while (1)
    {
        if (fread (buffer, 512,1, usb_ptr) < 1)
            break;
        i++;
        if (!(i & 127))
            printf ("%d sectors read..\r", i);
        if (buffer[0] == 'B' && buffer[1] == 'M')
        {
            for (j=0; j<32; j++)
                printf ("%02X ", buffer[j]);
            printf ("| ");
            for (j=0; j<32; j++)
            {
                if (buffer[j] >= ' ' && buffer[j] <= '~')
                    printf ("%c", buffer[j]);
                else
                    printf (".");
            }
            printf ("\n");
        }
    }

    fclose (usb_ptr);

    return 0;
}

(Afterthought) It's pretty slow for a 1Gb disk .. perhaps it's faster to read more sectors at once. (Testing..) yup, way faster to read even as little as 10 sectors inside the loop.

OTHER TIPS

A partial answer:

EOF is not a valid character. There are never any EOF characters inside a file, or on the disk. EOF is a value some functions return when you get to the end of the file. getchar,for example, returns an int, and not a char, for this reason: so that it can returns -1, which is not a valid char value. See here for more info.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top