How to start prediction from dataset?

https://datascience.stackexchange.com/questions/12127

16-10-2019
|

Question

I have event dataset in a factless table. It has list of events

timestamp-> event name -> node ( In network)

There is always a sequence of event happening. So how do I start predicting future event based on past one and discover list of nodes that will be affected from past experience.

I am programmer without knowledge of machine learning. I have installed spark,R and have dataset in oracle database.Is there any tutorial/algorithm that I can use to get started. I taught myself scala/R but have no idea on getting started. My dataset are huge i.e. more than 9billion rows for 3 months.

Node            Eventtime       alarmname
192.168.1.112   6/14/2016 19:41 Main power supply has a fault alarm
192.168.1.113   6/14/2016 19:23 Association path broken
192.168.1.113   6/14/2016 19:23 NA
192.168.1.113   6/14/2016 19:23 Association broken
192.168.1.112   6/14/2016 19:23 Mains Failure
192.168.1.112   6/14/2016 19:23 Mains Failure

Additional Information:

I have 98 nodes. I would like to predict:

i. No of node that has alarm or goes down when a single node goes down e.g. if node A has alarm if in 1 month list of node that has alarm at the same period

ii. The sequence of event occurrence i.e. if one node has mains failure then the next event would be node down.

Solution

The problem you are facing is a time series problem. Your events are categorial which is a specific case (so most common techniques like arima and Fourier transform are irrelevant).

Before getting into the analysis, try to find out whether the events among nodes are independent. If they are independent, you can break them into sequences per node and analyze them. If they are not (e.g., "Main power supply has a fault alarm" on node x indicates the same event on node y) you should the combined sequence. Sometime even when the the sequence are dependent you can gain from using the per node sequence as extra data.

You data set is quite large, which means that computation will take time. You data is probably noisy, so you will probably have some mistakes. Therefore, I recommend advancing in small steps from simple models to more complex ones.

Start with descriptive statistics, just to explore the data. How many events do you have? How common are they? What is the probability of the events that you try to predict? Can you remove some of the events as meaningless (e.g., by using domain knowledge)?

In case you have domain knowledge that indicates that recent events are the important ones, I would have try predicting based on the n last events. Start with 1 and grow slowly since the number of combinations will grow very fast and the number of samples you will have for each combination will become small and might introduce errors.

Incase that the important event are not recent, try to condition on these events in the past.

In most cases such simple model will help you get a bit above the baseline but not too much. Then you will need more complex models. I recommend using association rules that fit your case and have plenty of implementations.

You can further advance more but try these technique first.

The techniques mentioned before will give you a model the will predict the probability that a node will be down, answering your question (ii). Running it on the sequence of the nodes will enable you to predict the number of nodes that will fail answering question (i).

OTHER TIPS

There will be so many follow steps (based on results in step 1), for the objectives you mentioned. I am mentioning starting steps to take analysis to next level.

For the first objective, you can calculate simple conditional probability for every node for some time window time frame. Here, you will get a view how each node is affecting other nodes. Also, explore Bayesian network on the data. For the second requirement, as recommended by Dan Levin, association rules are good point to start. To simplify process, you can start with two main events (may be mains failure and association path broken) from the available events. Fix the RHS in association rules to the two main events as mentioned. LHS will be events occurred before RHS events (consider some time window). Now run association rules on the data. You will be able to find some precursors for the two events considered. You can find implementation of association rules in Spark in the following paper:

R-Apriori: An Efficient Apriori based Algorithm on Spark

Link: http://www.iith.ac.in/~mkaul/papers/pikm09-rathee.pdf

Further, to use association rule for prediction, considering the following points will help

There must be a time lag between LHS and RHS, i.e. time gap between antecedent (if one node has mains failure) and consequent of the rule (next event would be node down)
Prediction rule must have relatively stable confidence with respect to the time frame determined by application domain i.e. run association rules on overall 3 months data, you will get some rules with some confidence. Now run association rules by month wise and see the confidence of same rules, if confidence is consistent, you can use those rules in prediction with confidence 

For more details refer:

http://link.springer.com/chapter/10.1007%2F11548706_11#page-1

You can also use association rules for objective 1 by considering node 1 in LHS and consider all events after nodel 1 event in RHS.

Hope this helps!!!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange