Pergunta

I am trying to put together a way to figure out if incidents occurred based on log content. Typically, on log (or DB table) would contain a list of transactions composed as follow: {Timestamp} {TransactionID} {Message}

As an example problem I have run into, I used this to detect if "incidents" occured by running some very basic queries such as:

SELECT Timestamp, count(*)
FROM table
GROUP BY Timestamp
HAVING COUNT(*) > 5
ORDER BY Timestamp

This works OK but has very strong caveats:

  • In times of high activity it will return records for nothing (and raising the bar will hide rows that should get returned)
  • Let say I have 4 events at timestamp t and 2 events at timestamp (t+1), my query will not return anything even if they should be considered. Unfortunately, aggregating timestamps by time ranges will lead me to the previous point about high activity time.

Would someone have some insights on how to tackle the broader problem of incident detection within logs?

Foi útil?

Solução

Would someone have some insights on how to tackle the broader problem of incident detection within logs?

There is two classical approaches to this problem.

In the first approach, you have a precise list of incidents you want to report. For each of these incident, you pick a kind of “mathematical definition” and a formal test for the incident. Most probably, your dictionary of incidents may evolve with time. Interestingly, you can put incident reports in the event flow you are analysing, which gives you the ability to define incidents patterns at a higher level.

In the second approach, you have a large database of “normal activity” and your statistic skills are in good shape. You then define numerical observables (like the time between two transactions, or the daily rate of some transaction, or whatever seems to make sense for your problem) and then apply univariate outliers detection methods.

The first approach gives you the opportunity to detect known incidents, the second gives you the chance to “see that something odd happens” bu will require you to look at what is actually going on.

Most probably you may want to use a mixture of both approaches, but you definitely need to analyse the importance and the likelyhood of each incident type before devising an efficient strategy.

Outras dicas

Your query is currently trying to detect incidents by the heuristic "many log events happened in a short window of time". For some classes of "incidents", this might be a good heuristic, but it sounds like it isn't for you. I don't know what you mean by "incident" in this context, but I assume you are trying to detect trouble on a server or something, but the

A better approach is to put more information in to the log. What information that should be will be specific to your problem domain. As an example, one way to add more information is to add an {event type} column to your data. Then you might have event types like "purchase successful" or "ERROR" or "transaction cancelled by user" or "search found no results", or whatever list of types defines the kind of events you are trying to notice. Then your queries can be structured to include that, and your heuristic can be improved or perhaps become a deterministic algorithm.

In general, deciding what to log and how to log it is a function of what questions you want to ask the logs. It's about what you need out of them. So if you start from that perspective, and then design your logging, it will be a lot easier to answer those questions when you need to.

Licenciado em: CC-BY-SA com atribuição
scroll top