Is there a faster way to skip parts of binary file than FileStream.Position

https://stackoverflow.com/questions/22906995

28-06-2023
|

Question

I need to read throug a very large amount of binary data from file. I have a fixed record size (38) and would like to skip several records at a time. I have tried doing this using FileStrea, Position or Seek, but it seems that take a finitie amount of time too. So even if I skip 10 records - I do not read through file 10 times fatsre.

Here is an SSCCE.

MODERATORS NOTE: This is not a repeat question, it is a follow-on I have extracted from another question to allow a different focus to be explored.

You will need to create 2 buttons Serialize and Deserialize.

Serialize creates a dummy data file.

Deserialize reads through it.

Comment out the fs.Position line to see raw read through of entire file. Takes 12 secs on my machine. Then uncomment it and file will skip 10 records each time. Was hoping for a factor 10 improvement in speed, BUT takes 8 secs on my machine. So I assume change fs.Position is expensive.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using ProtoBuf;
using System.IO;
using System.Diagnostics;

namespace BinTest3
{


    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Serialize_Click(object sender, EventArgs e)
        {

            FileStream outBin = null;

            string binFileName = @"C:\binfile.dft";
            outBin = File.Create(binFileName, 2048, FileOptions.None);

            DateTime d = DateTime.Now;

            TickRecord tr = new TickRecord(d, 1.02, 1.03,200,300);

            for (int i =0; i < 20000000; i++)
            {
                tr.BidPrice += 1;
                Serializer.SerializeWithLengthPrefix(outBin, tr, PrefixStyle.Base128);
            }

            outBin.Close();
            label1.Text = "Done ";
        }

        private void Deserialize_Click(object sender, EventArgs e)
        {
            Stopwatch sw = new Stopwatch();
            sw.Start();

            FileStream fs;
            string binFileName = @"C:\binfile.dft";

            fs = new FileStream(binFileName, FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 4096);
            long skipRate =10;
            int count = 0;
            TickRecord tr;

            long skip = (38*skipRate);
            try
            {
                while ((tr = Serializer.DeserializeWithLengthPrefix<TickRecord>(fs, PrefixStyle.Base128)) != null) //fs.Length > fs.Position)
                {
                    count++;

                    fs.Position += skip;  //Comment out this line to see raw speed

                }
            }
            catch (Exception)
            {

            }

            fs.Close();

            sw.Stop();
            label1.Text = "Time taken: " + sw.Elapsed + " Count: " + count.ToString("n0");

        }
    }


    [ProtoContract]
    public class TickRecord
    {

        [ProtoMember(1, DataFormat = DataFormat.FixedSize)]
        public DateTime DT;
        [ProtoMember(2)]
        public double BidPrice;
        [ProtoMember(3)]
        public double AskPrice;
        [ProtoMember(4, DataFormat = DataFormat.FixedSize)]
        public int BidSize;
        [ProtoMember(5, DataFormat = DataFormat.FixedSize)]
        public int AskSize;

        public TickRecord()
        {

        }

        public TickRecord(DateTime DT, double BidPrice, double AskPrice, int BidSize, int AskSize)
        {
            this.DT = DT;
            this.BidPrice = BidPrice;
            this.AskPrice = AskPrice;
            this.BidSize = BidSize;
            this.AskSize = AskSize;

        }



    }
}

Solution

The disk cannot read a single byte faster than it reads two bytes. The disk has to read large chunks at a time. So skipping over a handful of records won't actually change performance. So you'll pay a fixed price for a single read up to some minimum size of data. That size will vary from disk to disk.

What's more, there is a significant overhead in calling the file APIs. If you only read a small amount at a time you are going to pay that overhead over and over again. It would be better to implement buffering in your code. Read large chunks of data into memory, and then resolve the actual reads from memory. Probably the most efficient way to implement that is to use a memory mapped file.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow