Domanda

Well, I think it is hard. Maybe you'll find it easy.

I have two files: BUYINGORDERS, PRODUCTMASTERLIST

BUYINGORDERS (ProductID,ProductDescription) goes like this:

1;fresh coke bottle 1 lt
2;cheese CheesyBrand yellow 2 kg
3;little newborn puppies 10 kg

PRODUCTMASTERLIST (ProductDescription,Price) goes like this:

CheesyBrand yellow cheap cheese 2 kg;3.40    
bottle of very fresh coke of 1 lt;2.90

I need to find the descriptions in BUYINGORDERS which are present in PRODUCTMASTERLIST. Thing is, as you can see, that lines are not strictly the same: the condition for matching is that every word in an entry in BUYINGORDERS' ProductDescription should be IN ANY ORDER in PRODUCTMASTERLIST's ProductDescription. The entries in PRODUCTMASTERLIST may even have more words.

So, despite being slightly different, line 1 from BUYINGORDERS matches line 2 from PRODUCTMASTERLIST, since words 'fresh','coke', 'bottle', '1' and 'lt' are among 'bottle of very fresh coke of 1 lt'.

Now, I am not asking you to do my homework (I wouldn't complain, though : ) , of course, but I would very much appreciate at the very least a possible approach on the matter.

È stato utile?

Soluzione

  1. Extract the whole field you care about
  2. Sort the values in each field
  3. Stick a ".*" between each value in the shorter string
  4. Look for the modified shorter sorted string in the longer sorted string using whatever supports regexps

By sorting the values, you can rely upon "fresh .* coke" matching "fresh coke" and "fresh friggin' coke"

Altri suggerimenti

OK - without giving you the complete answer, here's how I'd tackle it:

  1. Figure out which file is likely to be smaller - you'll probably need to read one file into an array, and then cycle round each line of the other file doing the comparisons
  2. Use IFS to split the line at the semi-colon
  3. Turn the descriptions into an arrays of words (e.g.bWords=( ${bDesc} ))
  4. Now you can count the number of words in the BUYINGORDERS description
  5. Look for each word in BUYINGORDERS description, and then with a nested loop for each word in PRODUCTMASTERLIST (e.g. for bWord in ${bWords[@]}; do for pWord in ${pWords[@]}; do ...)
  6. Every time you find that bWord == pWord, increment a counter. If the counter reaches the number of words in bWords, you've met your condition
  7. Take special care for repeated words in either description - you don't want to count them twice and produce false positives (e.g. use continue when a match is found - and I learnt recently you can use continue 2 to move to the next iteration of an outer loop from an inner loop. Which is nice.)

This is a brute force search, so not very efficient. If the files are large, it could get slow. The alternative would be to hash each description using it's words - if you were clever about how you generate the hash, you might even be able to use a binary AND operation to see if one description was 'contained' within the other. But I'm not sure Bash is really up to that :)

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top