Is there a cost associated with converting Koalas dataframe to Spark dataframe?

https://datascience.stackexchange.com/questions/68222

09-12-2020
|

Question

I know that pandas works "under the hood" with numpy arrays stored in dictionaries. In contrast, Koalas works with the underlying Spark framework. Does that mean that there is no extra cost associated with switching back and forth between Koalas and PySpark dataframes?

#convert to pyspark dataframe
df.to_spark()

#convert to kolas frame
koalas_df = ks.DataFrame(df)

Edit: With cost I mean, does it ks.Dataframe(ks) create additional overhead? For example, toPandas() results in the collection of all records in the DataFrame to the driver program. Therefore we can only do toPandas() on a small subset of data.

Since I am switching between Koalas and Spark I am wondering if there is any such overhead or if Koalas "interprets" Spark dataframes without collecting records on the driver. At the moment I am working with a small subset of the data, but I am interested in any drawbacks when using larger amounts.

Solution

As you said, since the Koalas is aiming for processing the big data, there is no such overhead like collecting data into a single partition when ks.DataFrame(df).

However, the overheads are occurred when creating a default columns for creating the _InternalFrame which internally manages the metadata between pandas and PySpark.

Koalas is internally using immutable frame named _InternalFrame, so you can refer the /koalas/databricks/koalas/internal.py if you want to more detail.

Here is a short code examples from internal.py that creating the default index if given Spark DataFrame has no index information.

https://github.com/databricks/koalas/blob/a42af49c55c3b4cc39c62463c0bed186e7ff9f08/databricks/koalas/internal.py#L478-L491

        if index_map is None:
            assert not any(SPARK_INDEX_NAME_PATTERN.match(name) for name in spark_frame.columns), (
                "Index columns should not appear in columns of the Spark DataFrame. Avoid "
                "index column names [%s]." % SPARK_INDEX_NAME_PATTERN
            )


            # Create default index.
            spark_frame = _InternalFrame.attach_default_index(spark_frame)
            index_map = OrderedDict({SPARK_DEFAULT_INDEX_NAME: None})


        if NATURAL_ORDER_COLUMN_NAME not in spark_frame.columns:
            spark_frame = spark_frame.withColumn(
                NATURAL_ORDER_COLUMN_NAME, F.monotonically_increasing_id()
            )

OTHER TIPS

I believe Kolas is the Databricks DF equivalent of a Python DF and the equivalent of a Spark DF (I think Kolas is very,very new; released just a few months ago). I don't know what you mean by cost, but you can easily switch between Spark DF and Pandas DF. See the examples below.

# Convert Koala dataframe to Spark dataframe
df = kdf.to_spark(kdf)

# Create a Spark DataFrame from a Pandas DataFrame
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame to a Pandas DataFrame
df = df.select("*").toPandas(sdf)

If you are asking how much you will be billed for the time used, it's just pennies, really. Personally, I think some things are a lot easier to do in Python, vs other languages. So, if you have some non-Python DF, and you want to convert it into a Python DF, to do a merge, or whatever, just make the conversion and do a merge.

result = pd.concat([df1, df2], axis=1)

Does that make sense?

A little late to the party on this one, but here is some code and output (run on my moderate laptop):

startTime = datetime.now()
spark_medium = koalas_medium.to_spark()
spark_medium.show()
print('Convert Medium Size DF to spark: (3 x 10,000): ' + str (datetime.now() - 
startTime))

startTime = datetime.now()
spark_large = koalas_large.to_spark()
spark_large.show()
print('Convert Large Size DF to spark: (23 x Millions of rows): ' + str 
(datetime.now() - startTime))

Output:

Convert Medium Size DF to spark: (3 x 10,000): 0:00:00.689036
Convert Large Size DF to spark: (23 x Millions of rows): 0:00:00.078039

Why it's slower to convert the medium DF I can't tell you, but running it several times the results were always roughly the same. Hope this helps someone.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange