Domanda

I have a machine learning problem. I am given a long list of domains and I have to figure out which are ecommerce websites and which are personal websites. It is kind of a difficult problem because I do not have any training data to work with. I have come up with a couple ideas:

  1. Go through a couple hundred of these websites manually to tell if they are business or personal and develop a training set this way (Long and boring!).

  2. Crawl these websites and search for some keywords eg. "Buy Now", "Price", "Credit Card". etc.

Does anybody have any other approaches?

Thanks

È stato utile?

Soluzione

You could adaptively modify your keyword sets: As you crawl around, a word that correlates highly with existing keywords can be added to the list. Peter p.s. I would add this as a comment but I don't have enough reputation points...

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top