Question

I want to analyze a large dataset (2,000,000 records, 20,000 customer IDs, 6 nominal attributes) using the Generalized Sequential Pattern algorithm.

This requires all attributes, aside from the time and customer ID attribute, to be binominal. Having 6 nominal attributes which I want to analyze for patterns, I need to transform those into binominal attributes, using the "Nominal to Binominal" Function. This is causing memory problems on my workstation (with 16GB RAM, of which I allocated 12 to the Java instance running rapidminer).

Ideally I would like to set up my project in a way, that it writes temporarily to the disc or using temporary tables in my oracle database, from which my model also reads the data directly. In order to use the "write database" or "update database" function, I need to have an existing table already in my database with boolean columns already (if I'm not mistaken).

I tried to write step by step the results of the binominal conversion into csv files onto my local disk. I started using the nominal attribute with the least distinct values, resulting in a csv file containing my dataset ID and now 7 binominal attributes. I was seriously surprised seeing the filesize being >200MB already. This is cause by rapidminer writing strings for the binominal values "true"/"false". Wouldn't it be way more memory efficient just writing 0/1?

Is there a way to either use the oracle database directly or working with 0/1 values instead of "true"/"false"? My next column would have 3000 distinct values to be transformed which would end in a nightmare...

I'd highly appreciate recommendations on how to use the memory more efficient or work directly in the database. If anyone knows how to easily transform a varchar2-column in Oracle into boolean columns for each distinct value that would also be appreciated!

Thanks a lot, Holger

edit:

My goal is to get from such a structure:

column_a; column_b; customer_ID; timestamp

value_aa; value_ba; 1; 1

value_ab; value_ba; 1; 2

value_ab; value_bb; 1; 3

to this structure:

customer_ID; timestamp; column_a_value_aa; column_a_value_ab; column_b_value_ba; column_b_value_bb

1; 1; 1; 0; 1; 0

1; 2; 0; 1; 1; 0

1; 3; 0; 1; 0; 1
Was it helpful?

Solution

This answer is too long for a comment.

If you have thousands of levels for the six variables you are interested in, then you are unlikely to get useful results using that data. A typical approach is to categorize the data going in, which results in fewer "binominal" variables. For instance, instead of "1 Gallon Whole Milk", you use "diary products". This can result in more actionable results. Remember, Oracle only allows 1,000 columns in a table so the database has other limiting factors.

If you are working with lots of individual items, then I would suggest other approaches, notably an approach based on association rules. This will not limit you by the number of variables.

Personally, I find that I can do much of this work in SQL, which is why I wrote a book on the topic ("Data Analysis Using SQL and Excel").

OTHER TIPS

You can use the operator Nominal to Numeric to convert true and false values to 1 or 0. set the coding type parameter to be unique integers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top