문제

I have a dataset of virtual currency earn and spend events from a mobile game app. Unfortunately, people cheat in the game to get more currencies. These cheaters use different techniques so its quite hard to detect them all in game. The dataset is also about 50 TB so the only option I have is to use SQL (on Google Bigquery).

It tried to make a standard outlier detection system in which I find the average and standard deviation of the currency earn and spend in each level. This works for the biggest outliers. However, some people cheat to earn, for example, 1e15 gold, while other "only" cheat to get 10000 gold. The normal gold earn rate should not be higher than about 1000. The standard outlier detection system works for the 1e15 gold earn person, but as the avg and std are so high due to that person, the 10000 gold is not found to be an outlier.

Does anyone have an idea how to find these outliers successfully?

My dataset looks something like this. (It should be noted that cheaters don't show up this often and that the dataset has in the order of billions of these rows):

user_id, currency, earn_or_spend, source_or_sink, amount
'user_1', 'gold', 'earn', 'quest', 3
'user_1', 'cash', 'earn', 'building_collect', 10000
'user_3', 'gold', 'spend', 'quest', 1
'user_2', 'gold', 'earn', 'quest', 4
'user_1', 'cash', 'earn', 'building_collect', 50000
'user_1', 'gold', 'earn', 'quest', 5
'user_4', 'gold', 'earn', 'quest', 99999 # cheater
'user_3', 'gold', 'spend', 'quest', 3
'user_5', 'gold', 'earn', 'quest', 1E15 # cheater
'user_3', 'cash', 'earn', 'level_up', 100000
'user_1', 'gold', 'earn', 'quest', 3
'user_1', 'cash', 'spend', 'build_building', 50000

It can be generated in Google BigQuery with this code:

WITH data as (
  SELECT 
    * 
  FROM UNNEST
  (
    ARRAY<STRUCT<user_id STRING, currency STRING, earn_or_spend STRING, source_or_sink STRING, amount FLOAT64>>
    [
      ('user_1', 'gold', 'earn', 'quest', 3),
      ('user_1', 'cash', 'earn', 'building_collect', 10000),
      ('user_3', 'gold', 'spend', 'quest', 1),
      ('user_2', 'gold', 'earn', 'quest', 4),
      ('user_1', 'cash', 'earn', 'building_collect', 50000),
      ('user_1', 'gold', 'earn', 'quest', 5),
      ('user_4', 'gold', 'earn', 'quest', 99999), # cheater
      ('user_3', 'gold', 'spend', 'quest', 3),
      ('user_5', 'gold', 'earn', 'quest', 1E15), # cheater
      ('user_3', 'cash', 'earn', 'level_up', 100000),
      ('user_1', 'gold', 'earn', 'quest', 3),
      ('user_1', 'cash', 'spend', 'build_building', 50000)
    ]
  )
)

SELECT * FROM data
도움이 되었습니까?

해결책

I managed to do it. I finally used three outlier filters.

  • Filter extreme cheaters: Per [level, currency source or sink, type (earn/spend), payer/non-payer, month of start (if game changes over time, new people might earn or spend in different ways than before)] I calculated the median and median absolute deviation (MAD) of the amount of currency earned/spend. Filter user if: amount earned/spend > 100 * median + 100 * MAD
  • Filter users that spend more of a currency than they have earned.
  • Filter users that earn a currency by means of an in app purchase (IAP) but didn't generate any real money revenue in that level (i.e. they did an IAP but didn't pay for it).
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top