What's a good way to enforce a single rate limit on multiple machines?

Question 1

The two things you stated were:

1)"I would like to rate limit so that only 10,000 requests get processed/day"
2)"while still load balancing those 10,000 requests."

First off, it seems like you are using a divide and conquer approach where each request from your end user gets mapped to one of the n machines. So, for ensuring that only the 10,000 requests get processed within the given time span, there are two options:

1) Implement a combiner which will route the results from all n machines to
another endpoint which the external API is then able to access. This endpoint is able to keep a count of the amount of jobs being processed, and if it's over your threshold, then reject the job.

2) Another approach is to store the amount of jobs you've processed for the day as a variable inside of your database. Then, it's common practice to check if your threshold value has been reached by the value in your database upon the initial request of the job (before you even pass it off to one of your machines). If the threshold value has been reached, then reject the job at the beginning. This, coupled with an appropriate message, has an advantage as having a better experience for the end user.

In order to ensure that all these 10,000 requests are still being load balanced so that no 1 CPU is processing more jobs than any other cpu, you should use a simple round robin approach to distribute your jobs over the m CPU's. With round robin, as apposed to a bin/categorization approach, you'll ensure that the job request is being distributed as uniformly as possible over your n CPU's. A downside to round robin, is that depending on the type of job you're processing you might be replicating a lot data as you start to scale up. If this is a concern for you, you should think about implementing a form of locality-sensitive hash (LSH) function. While a good hash function distributes the data as uniformly as possible, LSH exposes you to having a CPU process more jobs than other CPU's if a skew in the attribute you choose to hash against has a high probability of occurring. As always, there's tradeoffs associated with both, so you'll know best for your use cases.

Question 2

use memcached or redis keep api request counter per client. check every request if out rate limit.
if you think checking at every request is too expensive,you can try storm to process request log, and async calculate request counter.

Question 3

Why not implement a simple counter in your database and make the API client implement the throttling?

User Agent -> LB -> Your Service -> Q -> Your Q consumers(s) -> API Client -> External API

API client checks the number (for today) and you can implement whatever rate limiting algorithm you like. eg if the number is > 10k the client could simply blow up, have the exception put the message back on the queue and continue processing until today is now tomorrow and all the queued up requests can get processed.

Alternatively you could implement a tiered throttling system, eg flat out til 8k, then 1 message every 5 seconds per node up til you hit the limit at which point you can send 503 errors back to the User Agent.

Otherwise you could go the complex route and implement a distributed queue (eg AMQP server) however this may not solve the issue entirely since your only control mechanism would be throttling such that you never process any faster than the something less than the max limit per day. eg your max limit is 10k so you never go any faster than 1 message every 8 seconds.

Question 4

If you're not adverse to using a library/service https://github.com/jdwyah/ratelimit-java is an easy way to get distributed rate limits.

If performance is of utmost concern you can acquire more than 1 token in a request so that you don't need to make an API request to the limiter 100k times. See https://www.ratelim.it/documentation/batches for details of that