Pyspark Matrix Transformation

https://datascience.stackexchange.com/questions/60267

02-11-2019
|

Frage

Let's assume I have the following dataframe in PySpark:

Customer    |  product  |   rating
customer1   |  product1 |   0.2343
customer1   |  product2 |   0.4440
customer2   |  product3 |   0.3123
customer3   |  product1 |   0.7430

There can be several customer product combinations but every combination is unique already. I want to archive the following outcome in the most efficient manner:

Customer (Index) | product 1 | product 2 | product 3
customer 1       |   0.2343  |  0.4440   |  0.0000
customer 2       |   0.0000  |  0.0000   |  0.3123
customer 3       |   0.7430  |  0.0000   |  0.0000

Each combination which is not represented in the first table will be set to zero. It has to be efficient because the output matrix will have a size of 59578 rows × 21521 columns and I want to avoid the computational cost as good as possible.

Is there any solutions for this? I didn't found a good solution on the web so far.

Thanks for your help up front.

Keine korrekte Lösung

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit datascience.stackexchange