Integration Hive with Mahout for Recommendation

https://stackoverflow.com//questions/22050460

21-12-2019
|

Question

I want to use mahout with hive, i will get data from hive and use data model some thing like to populate data and use mahout for recommendation. is this possible. because i have seen mahout is working only for files. 1) how to load data to mahout using hive table? 2) is there any other way i can use mahout recommendation with hive or others?

here i am having hive jdbc result, i want to populate to DataModel in mahout. how to populate?

i want to use database result instead of reading from file for mahout recommendation. for example :

hive:

    import java.sql.SQLException;
    import java.sql.Connection;
    import java.sql.ResultSet;
    import java.sql.Statement;
    import java.sql.DriverManager;

    public class HiveJdbcClient {
      private static String driverName = "org.apache.hive.jdbc.HiveDriver";

      /**
       * @param args
       * @throws SQLException
       */
      public static void main(String[] args) throws SQLException {
          try {
          Class.forName(driverName);
        } catch (ClassNotFoundException e) {
          // TODO Auto-generated catch block
          e.printStackTrace();
          System.exit(1);
        }
        //replace "hive" here with the name of the user the queries should run as
        Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", "");
        Statement stmt = con.createStatement();
        String tableName = "testHiveDriverTable";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName + " (key int, value string)");
        // show tables
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) {
          System.out.println(res.getString(1));
        }
           // describe table
        sql = "describe " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
          System.out.println(res.getString(1) + "\t" + res.getString(2));
        }

        // load data into table
        // NOTE: filepath has to be local to the hive server
        // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
        String filepath = "/tmp/a.txt";
        sql = "load data local inpath '" + filepath + "' into table " + tableName;
        System.out.println("Running: " + sql);
        stmt.execute(sql);

        // select * query
        sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
          System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
        }

        // regular hive query
        sql = "select count(1) from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
          System.out.println(res.getString(1));
        }
      }
    }

mahout:

// Create a data source from the CSV file
File userPreferencesFile = new File("data/dataset1.csv");
DataModel dataModel = new FileDataModel(userPreferencesFile);

UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(2, userSimilarity, dataModel);

// Create a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
Recommender genericRecommender =  new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);

// Recommend 5 items for each user
for (LongPrimitiveIterator iterator = dataModel.getUserIDs(); iterator.hasNext();)
{
    long userId = iterator.nextLong();

    // Generate a list of 5 recommendations for the user
    List<RecommendedItem> itemRecommendations = genericRecommender.recommend(userId, 5);

    System.out.format("User Id: %d%n", userId);

    if (itemRecommendations.isEmpty())
    {`enter code here
        System.out.println("No recommendations for this user.");
    }
    else
    {
        // Display the list of recommendations
        for (RecommendedItem recommendedItem : itemRecommendations)
        {
            System.out.format("Recommened Item Id %d. Strength of the preference: %f%n", recommendedItem.getItemID(), recommendedItem.getValue());
        }
    }
 }

Solution

Mahout version 0.9 provides data model (source data) for JDBC complaint databases such as MySQL/Oracle/Postgress etc, NoSQL databases such as MongoDB/HBase/Cassandra and File system based as you mentioned.

As of this release, Hive is not 100% SQL standard database, the data model MySQLJDBCDataModel and SQL92JDBCDataModel is not appropriate to use for the Hive tables as the SQL syntax is quite different in the JDBC complaint databases.

For your first question, you might want to extend the AbstractJDBCDataModel and override the constructor to pass in the Hive Datasource and hive specific SQL queries for preference,preference time,user, all users etc similar to the one mentioned in the AbstractJDBCDataModel constructor.

For your second question, the above method hold good if you are using the non-distributed algorithm(Taste algorithms). If distributed algorithm is used, the Mahout can run on Hadoop sourcing the HDFS files which are backed by the Hive table. Please see here on running Mahout on Hadoop

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow