Pregunta

I am using Java to read and process some datasets from the UCI Machine Learning Repository. I started out with making a class for each dataset and working with the particular class file. Every attribute in the dataset was represented by a corresponding data member in the class of the required type. This approach worked fine till no. of attributed <10-15. I just increased or decreased the data members of the class and changed their types to model new datasets. I also made the required changes to the functions.

The problem: I have to work with much large datasets now. Ones with >20-30 attributes are vey tedious to work with in this manner. I dont need to query. My data discretization algorithm just needs 4 scans of the data to discretize it. My work ends right after the discretization. What would be an effective strategy here?

I hope I have been able to state my problem clearly.

¿Fue útil?

Solución

Create a simple DataSet class that contains a member like the following:

 public class DataSet {
     private List<Column> columns = new ArrayList<Column>();
     private List<Row> rows = new ArrayList<Row>();

     public void parse( File file ) {
         // routines to read CSV data into this class
     }
 }

 public class Row {
     private Object[] data;

     public void parse( String row, List<Column> columns ) {
         String[] row = data.split(",");
         data = new Object[row.length];

         int i = 0;
         for( Column column : columns ) {
             data[i] = column.convert(row[i]);
             i++;
         }
     }
 }

 public class Column {
     private String name;
     private int index;
     private DataType type;

     public Object convert( String data ) {
         if( type == DataType.NUMERIC ) {
            return Double.parseDouble( data );
         } else {
            return data;
         }
     }
 }

 public enum DataType {
     CATEGORICAL, NUMERIC
 }

That'll handle any data set you wish to use. The only issue is the user must define the dataset by defining the columns and their respective data types to the DataSet. You can do it in code or reading it in from a file whatever you think is easier. You might be able to default a lot of the configuration data (say as CATEGORICAL), or attempt to parse the field if that fails it must be CATEGORICAL otherwise its numeric. Normally, the file contains a header you could parse to find the names of the columns, then you just need to figure out the data type by looking at the data in that column. A simple algorithm to guess the data type goes a long way in aiding you. Essentially this is the exact same data structure every other package uses for data like this (eg R, Weka, etc).

Otros consejos

Some options:

  1. Write a code generator to read the meta-data of the file and generate the equivalent class file.
  2. Don't bother with classes; keep the data in arrays of Object or String and cast them as needed.
  3. Create a class that contains a collection of DataElements and subclass DataElements for all the types you need and use the meta-data to create the right class at runtime.

I did something like that in one of my projects; lots of variable data, and in my case I obtained the data from the Internet. Since I needed to query, sort, etc., I spent some time designing a database to accommodate all the variations of the data (not all entries had the same number of properties). It did take a while but in the end I used the same code to get the data for any entry (using JPA in my case). My IDE (NetBeans) created most of the code straight using the database schema.

From your question, it is not clear on how you plan to use the data so I'm answering based on personal experience.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top