Is there a way to speed up reading the data?

https://stackoverflow.com/questions/10984059

13-06-2021
|

Frage

In the program I created the following logic for reading the data from the database and storing it into List<>:

                NpgsqlCommand cmd = new NpgsqlCommand(query, conn);
                List<UserInfo> result = new List<UserInfo>();
                Npgsql.NpgsqlDataReader rdr = cmd.ExecuteReader();
                while (rdr.Read())
                {
                    string userId = rdr[0].ToString();
                    string sex = rdr[1].ToString();
                    string strDateBirth = rdr[2].ToString();
                    string zip = rdr[3].ToString();

                    UserInfo userInfo = new UserInfo();
                    userInfo.Msisdn = userId;
                    userInfo.Gender = sex;
                    try
                    {
                        userInfo.BirthDate = Convert.ToDateTime(strDateBirth);
                    }
                    catch (Exception ex)
                    {
                    }
                    userInfo.ZipCode = zip;
                    userInfo.DemographicsKnown = true;
                    userInfo.AgeGroup = getAgeGroup(strDateBirth);
                    if (result.Count(x => x.Id== userId) == 0)
                        result.Add(userInfo);
                }

The performance of this code is really poor. There are over 2M of records and after half an hour the list userInfo contains just 300.000 records.

Does anyone know how to speed up data reading from the database?

Lösung

You are using .Count when you really mean .Any()
Whenever you call .Count you are enumerating the entire collection just to see if you have a single match....

Consider the question you're asking:
"How many rows do you have that match this condition? Is that number equal to zero?"

What you really mean is:
"Do any rows match this condition?"

In that context, you could create a Hashset of the userId values. Checking for the existence in a Hashset (or dictionary) can be much faster than checking the same in a list.

Furthermore, if you do already have the userId, then you parsed and read all the values for no reason. Check for myHashset.Contains(userId) first, then add.

This is the primary reason it's slow. For n rows you're performing the nth-triangular enumerations of the collection!

EDIT: Consider this untested change: I don't know if your reader supports typed read methods like GetString() so if it doesn't then simply use what you had before.

NpgsqlCommand cmd = new NpgsqlCommand(query, conn);
List<UserInfo> result = new List<UserInfo>();
Npgsql.NpgsqlDataReader rdr = cmd.ExecuteReader();
HashSet<string> userHash = new HashSet<string>(); // is this actually an int?

while (rdr.Read())
{
    string userId = rdr.GetString(0);
    If (!userHash.Contains(userId))
    {
        string strDateBirth = rdrGetString(2);
        UserInfo userInfo = new UserInfo();
        userInfo.Msisdn = userId;
        userInfo.Gender = rdr.GetString(1);
        datetime parseddate; // this is not used if the parse fails
        if (Datetime.TryParse(strDateBirth, out parseddate))
        {
            userInfo.BirthDate = parseddate;
            // userInfo.AgeGroup = getAgeGroup(strDateBirth); // why take the string?
            // rewrite your getAgeGroup method to take the datetime
            userInfo.AgeGroup = getAgeGroup(parseddate);
        }
        userInfo.ZipCode = rdr.GetString(3);
        userInfo.DemographicsKnown = true;
        result.Add(userInfo);
        userHash.Add(userId);
    }
}

This will always keep the first instance of a user row you find (which is what your current code does). If you want to keep the last instance then you can use a dictionary and eliminate the .Contains() call altogether.

EDIT: I just noticed that my sample never added the userId to the hash... whoops... added it in there.

Andere Tipps

All of that execption handling is slowing down your program a LOT. Exceptions are for Exceptional Cases If your code is throwing more than 10 execptions you need to re-think your design.

Instead of throwing a execption every time there is a malformed date use DateTime.TryParse(string, DateTime) instead. It will speed up your code a lot.

////Replace This
//try
//{
//    userInfo.BirthDate = Convert.ToDateTime(strDateBirth);
//}
//catch (Exception ex)
//{
//}

//With this
DateTime bithDate;
if(DateTime.TryParse(strDateBirth, out bithDate)
{
    userInfo.BirthDate = bithDate;
}

Also what is the datatype of the column at rdr[2]? Is it already a DateTime? Another thing to do is stop calling ToString on objects everywhere and use the correct methods.

while(rdr.Read())
{
    UserInfo userInfo = new UserInfo();
    userInfo.Msisdn = rdr.GetString(0);
    userInfo.Gender = rdr.GetString(1);

    DateTime? birthdate = null; //This is a nullable DateTime see http://msdn.microsoft.com/en-us/library/b3h38hb0.aspx

    if(rdr.IsDbNull() == false)
    {
        birthdate = rdr.GetDateTime(2);
        userInfo.BirthDate = birthdate.Value;
    }
    userInfo.ZipCode = rdr.GetString(3);
    userInfo.DemographicsKnown = true;
    userInfo.AgeGroup = getAgeGroup(birthdate); //You may need to edit getAgeGroup to take in a nullable DateTime

    if (result.Any(x => x.Id== userId)) //Any is much faster than count for your check, see Matthew PK's answer.
        result.Add(userInfo);
}

In order to speed up getting data from database you may want to consider different way of reading data instead of looping through reader.

DataSet my_dataset = new DataSet();
NpgsqlDataAdapter my_dataadapter = default(NpgsqlDataAdapter);

NpgsqlCommand cmd = new NpgsqlCommand(query, conn);
my_dataadapter = new NpgsqlDataAdapter(cmd);
my_dataadapter.Fill(my_dataset, "mydataset");

Then do whatever with dataset.
You may be very surprised with difference in speed.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow