Question

I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.

What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.

No reproducible code, but can anyone give me pytables data management advice?

Big thanks in advance...

Was it helpful?

Solution

See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows

Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.

You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top