MapReduce - For each student, what is the hour during which the student has posted the most posts

StackOverflow https://stackoverflow.com/questions/23523911

  •  17-07-2023
  •  | 
  •  

Domanda

I've a dump of SO records on Hadoop. I am wondering what is a good way of answering the following question

Sample record

<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="251" ViewCount="15207" Body="&lt;p&gt;I want to use a track-bar to change a form's opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;This is my code:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;When I try to build it, I get this error:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA;  &lt;p&gt;Cannot implicitly convert type 'decimal' to 'double'.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&#xA;&lt;p&gt;I tried making &lt;strong&gt;trans&lt;/strong&gt; to &lt;strong&gt;double&lt;/strong&gt;, but then the control doesn't work. This code has worked fine for me in VB.NET in the past. &lt;/p&gt;&#xA;" OwnerUserId="8" LastEditorUserId="2648239" LastEditorDisplayName="Rich B" LastEditDate="2014-01-03T02:42:54.963" LastActivityDate="2014-01-03T02:42:54.963" Title="When setting a form's opacity should I use a decimal or double?" Tags="&lt;c#&gt;&lt;winforms&gt;&lt;forms&gt;&lt;type-conversion&gt;&lt;opacity&gt;" AnswerCount="13" CommentCount="25" FavoriteCount="23" CommunityOwnedDate="2012-10-31T16:42:47.213" />

My first cut

Key = userid_hour

So now i'd know the count of each users posts by hour. I then need to post process this data to pick the max count per user and then see the most active hours.

Question

What other alternatives we have to simplify this?

È stato utile?

Soluzione

I think you've got it as simple as it can be.

The first job gives you a count of posts per user per hour

  • Input: record
  • Intermediate: k=user+hour; v=1
  • Output: k=user+hour; v=count

A second job discovers each user's most active hour. As @pangea notes this involves a descending secondary sort. Normally each reducer call gets passed the values for a single, unique key value. You can use a grouping comparator to combine values for multiple key values for a single reducer call. Here, a grouping comparator could "instruct hadoop" to group all composite key values for a given user together in order to pass all hourly counts per user into a single call to the reducer.

  • Input: k=user+hour; v=count
  • Intermediate: k=user+count; v=hour+count
  • Output: k=user; v=most-active-hour

A third job gives you a count of the number of users who's max output falls in a certain hour (by hour, of course). As @pangea notes this involves a secondary sort.

  • Input: k=user; v=most-active-hour
  • Intermediate: k=hour; v=1
  • Output: k=hour; v=number-users-most-active-this-hour

You can force the use of a single reducer for job 3 and that would let you keep state in the reducer instance and sort/report that data in the cleanup() method - instead of adding a fourth job - but that's the kind of technique that doesn't scale. In this case it works because you have at most 24 values to sort.

Altri suggerimenti

You can create composite key with two fields: userId, hour. Then you could sort keys by both fields and group them by userId. For each userId in reducer you will have to iterate through sorted list of hours and its easy to calculate hour with maximum amount of posts.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top