I want to analyze a large dataset (2,000,000 records, 20,000 customer IDs, 6 nominal attributes) using the Generalized Sequential Pattern algorithm.
This requires all attributes, aside from the time and customer ID attribute, to be binominal. Having 6 nominal attributes which I want to analyze for patterns, I need to transform those into binominal attributes, using the "Nominal to Binominal" Function. This is causing memory problems on my workstation (with 16GB RAM, of which I allocated 12 to the Java instance running rapidminer).
Ideally I would like to set up my project in a way, that it writes temporarily to the disc or using temporary tables in my oracle database, from which my model also reads the data directly. In order to use the "write database" or "update database" function, I need to have an existing table already in my database with boolean columns already (if I'm not mistaken).
I tried to write step by step the results of the binominal conversion into csv files onto my local disk. I started using the nominal attribute with the least distinct values, resulting in a csv file containing my dataset ID and now 7 binominal attributes. I was seriously surprised seeing the filesize being >200MB already. This is cause by rapidminer writing strings for the binominal values "true"/"false". Wouldn't it be way more memory efficient just writing 0/1?
Is there a way to either use the oracle database directly or working with 0/1 values instead of "true"/"false"? My next column would have 3000 distinct values to be transformed which would end in a nightmare...
I'd highly appreciate recommendations on how to use the memory more efficient or work directly in the database. If anyone knows how to easily transform a varchar2-column in Oracle into boolean columns for each distinct value that would also be appreciated!
Thanks a lot,
Holger
edit:
My goal is to get from such a structure:
column_a; column_b; customer_ID; timestamp
value_aa; value_ba; 1; 1
value_ab; value_ba; 1; 2
value_ab; value_bb; 1; 3
to this structure:
customer_ID; timestamp; column_a_value_aa; column_a_value_ab; column_b_value_ba; column_b_value_bb
1; 1; 1; 0; 1; 0
1; 2; 0; 1; 1; 0
1; 3; 0; 1; 0; 1