Question

I'm pushing my logs to a local splunk installation. Recently I found that the following error repeats a lot (about once every minute):

Error L10 (output buffer overflow): 7150 messages dropped since 2013-06-26T19:19:52+00:00.134 <13>1 2013-07-08T14:59:47.162084+00:00 host app web.1 - [\x1B[37minfo\x1B[0m] application - Perf - it took 31 milliseconds to fetch row IDs ...

The errors repeat quite a lot, and in the documentation it is said that these errors happen when your application produces a lot of logs.

Thing is, I barely have 20-30 logs per second, which isn't really considered a lot. I tested with other drains (added the built-in papertrail plugin), and these errors do not happen there - so they are specific to the outgoing splunk drain.

I thought maybe the splunk machine was loaded and thus not accepting logs fast enough, but its CPU is idle, and it has plenty of disk & memory.

Also, I believe the app (Play 2 app) is auto-flushing logs to console all the time, so there is no big buildup of unflushed logs followed by a release.

What can cause a slow drain speed for the outgoing splunk drain? How should I debug it?

Was it helpful?

Solution

After a long ping-pong with the Heroku team, we found the answer:

I used the URL prefix http:// when configuring the log drain, instead of syslog://. When I changed the URL to syslog://, the error went away, and logs are correctly flowing through splunks.

OTHER TIPS

My POV is that just because the errors went away, does not mean that you solved the problem. The HTTP protocol give a synchronous response. So, if you are reaching a threshold, whether it be a capacity limitation, or a business agreement threshold, the HTTP response code will be the indication. With Sumo Logic, if you are over exceeding your bursting rate limits, we will return a 429 response code. Heroku Logplex is not tuned for negative response codes, and will drop data. With a syslog endpoint, you may be losing data as well, except that syslog does not have a response channel, so its only option is to drop the data. For Sumo Logic, you will see notifications in the Audit Log indicating that throttling is being applied. When this happens you should contact support, or your account team, to either make adjustments to your limits, or increase your plan.

For me the problem was that the app was offline due to an error, and somehow the LogDNA add-on did not like that. Simply fixing the bug, redeploying and restarting the dyno solved the problem in my case.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top