Avoid reloading DataFrame between different python kernels

https://datascience.stackexchange.com/questions/16374

16-10-2019
|

Question

Is there a way of keeping a variable (large table / data frame) in memory and share it across multiple ipython notebooks?

I'd be looking for something, which is conceptually similar to MATLAB's persistent variables. There it is possible to call a custom function / library from multiple individual editors (notebooks), and have that external function cache some result (or large table).

Mostly I would like to avoid reloading a heavily used table (which is loaded through a custom library that is is called from the notebooks ), since reading it takes around 2-3 minute whenever I start a new analysis.

Solution

If it's important for your use cases, you could try switching to Apache Zeppelin. As all Spark notebooks there share the same Spark context, same Python running environment. https://zeppelin.apache.org/

So what you're asking happens natively in Zeppelin. Or to be complete, it is an option to share the same Spark context / same Python envrionment between all Spark notebooks (they're called 'notes' in Zeppelin):

So you can choose to share context Globally (default Zeppelin's behavior), Per Note (the only possible Jupyter's behavior), or Per User.

If you can't / don't want to switch to Zeppelin, look at other options of sharing common dataframes between your notebooks using:

ps. You can't import ipynb files to Zeppelin currently as of now (it has its own notebook format stored as a json file), until https://issues.apache.org/jira/browse/ZEPPELIN-1793 is implemented; although it's not that hard to convert them manually in most cases.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange