Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

https://stackoverflow.com/questions/23368178

12-07-2023
|

Question

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .

The logs have roughly this format here:

DEBUG:  24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.

the format regex I am using is as follows:

format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/

Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:

Record
Key     Value
pri     DEBUG
date    24.04.2014
subject     SingleActivityStrategy
msg     Start Activitiy 'barbecue' zu verabeiten.

Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.

I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.

In general, the logs always look roughly like this: First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.

Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:

Fluentular

Sorry I couldn't make it work like a regular link to just click on.

Another link to test out regex with my format and test string is this one:

http://rubular.com/r/dfXOkQYNXP

tl;dr version:

my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?

Solution

How the regex would look if you're trying to match the data specifically:

(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)

In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.

So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):

(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))

Meaning your td-agent.conf should look like:

<source> 
  type tail 
  path /var/log/foo/bar.log 
  pos_file /var/log/td-agent/foo-bar.log.pos 
  tag foo.bar 
  format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/ 
</source>

I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.

And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow