Question

I am fairly new to Docker and Kubernetes and I have a question for which I could not figure out the answer myself. I am working on an application that does string matching on data extracted from multiple sources but the problem is it's painfully slow. The bottleneck is composed of two nested for loops. For each row of the data frame X, scan all rows of data frame Y, and does a series of checks inside an if-elif construct, to determine if the match is acceptable. The result is also supplied in the form of a data frame.

It's written in Python and mainly uses Pandas data frames. As far as I know it could be sped up only by a faster processor which I currently do not have access to. I tried parallel processing but the overhead between the loops was too big an it resulted in even longer execution times (or my implementation was not good).

My question is, could I speed it up if I containerize with Docker and deploy to a Kubernetes cluster with 2-3 nodes on my LAN?

Keep in mind that it is a linear application, I did not write any code to make it ready for parallel processing, and we are talking about a single process, not more processes initiated by more users. In this case, does Kubernetes know to distribute the workload (maybe the processing of the various iterations on the rows of data frame X ?) on the nodes, and then put it all together to deliver the answer?

Thank you!

Was it helpful?

Solution

The Answer to Your Stated Question

tl;dr No.

Kubernetes does not inherently support parallelization of code executed within a container runtime.

Kubernetes core abstraction - the pod - enables you to spin up multiple copies of the same application on different nodes, but does not inherently coordinate across these applications for a single request. Kubernetes lets you spin up a Service, which load balances requests across each pod - however, no abstraction in Kubernetes lets you send requests to more than one pod at a time or "partition" requests so that they are sent to all pods at once.

That being said, Kubernetes does support coordinating across pods through means of stateful sets and operators. Neither are particularly easy to work with and are designed with the use case of managing distributed databases in mind, which require long-lived persistence and a notion of leader election. Neither feature solves what you're trying to do.

The Answer to Your Actual Problem

In my opinion, you are trying to solve your issue at the wrong layer of abstraction. Kubernetes and Docker are designed to simplify deploying and managing reproducibility across environments for your application. They don't solve your fundamental problem, which is pure performance .

Here are some obvious points I'm seeing:

  1. You're doing a brute-force search across all of Y for every term found in X, but what you're trying to do is essentially what's known as full-text search. Why not take a leaf from the people who solved full-text searching and generate an inverted index?

    In other words, you would preprocess dataframe Y and generate a map Z from words in Y to location in Y. Then you just check if your search term in X exists in Z - if it does, great, you've found your location in Y in O(1) time (assuming preprocessing Y to form Z is cheaper than searching over all of Y for every term in X, which it should be since you're only doing it once).

  2. If that approach doesn't yield success, you can still "buy" your way out of the problem by using a MapReduce implementation such as Hadoop or Spark (the latter of which I strongly recommend on account of native Python bindings) and putting your 2-3 nodes to good use.

    Unlike Kubernetes, this actually does solve the problem of partitioning Y across multiple nodes, having each of them check if a term from X lives in their partition, and then combining the results of each node's search.

    1. If that still doesn't yield success, it may be time to ditch using pandas altogether and use a database with fast full-text search enabled. The logic here is that you join all your multiple sources together from what used to be Y, store it in a production-ready database, and then query the database for all instances of your search term.

These are all suggestions and your mileage may vary, but based on your description, you appear to be asking a classic XY problem and I hope this gives you some context of why Kubernetes is the wrong tool for what you're trying to solve here.

Licensed under: CC-BY-SA with attribution
scroll top