This seems like a task where you DataFrames should be the tool for the job. I'm not sure how well it currently works, but the documentation suggests that there is a join()
method that can do both inner
and outer
joins, as you request. I have seen some issues about making DataFrames more like a inMemory database, but I have not followed the discussion closely enough to know.
When the problem size gets really big, I would really suggest that you consider using a relational database like MySQL, or sqlite. They are carefully tuned for doing exactly these kind of operations, and provide you with a simple declarative language to express what result you want, and let the system work out how this can be done in the fastest way possible.