Question

I have two CSV files(each of the file size is in GBs). I am trying to merge the two CSV files but every time I try to it my computer hangs. Is there no way to merge the files in chunks in pandas itself?

Was it helpful?

Solution

No, there is not. You will have to use an alternative tool like dask, drill, spark, or a good old fashioned relational database.

OTHER TIPS

When faced with such situations (loading & appending multi-GB csv files), I found @user666's option of loading one data set (e.g. DataSet1) as a Pandas DF and appending the other (e.g. DataSet2) in chunks to the existing DF to be quite feasible.

Here is the code I implement:

import pandas as pd

amgPd = pd.DataFrame()
for chunk in pd.read_csv(path1+'DataSet1.csv', chunksize = 100000, low_memory=False):
    amgPd = pd.concat([amgPd,chunk])
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top