What are best practices for collecting, maintaining and ensuring accuracy of a huge data set?

https://stackoverflow.com/questions/4505502

12-10-2019
|

Question

I am posing this question looking for practical advice on how to design a system.

Sites like amazon.com and pandora have and maintain huge data sets to run their core business. For example, amazon (and every other major e-commerce site) has millions of products for sale, images of those products, pricing, specifications, etc. etc. etc.

Ignoring the data coming in from 3rd party sellers and the user generated content all that "stuff" had to come from somewhere and is maintained by someone. It's also incredibly detailed and accurate. How? How do they do it? Is there just an army of data-entry clerks or have they devised systems to handle the grunt work?

My company is in a similar situation. We maintain a huge (10-of-millions of records) catalog of automotive parts and the cars they fit. We've been at it for a while now and have come up with a number of programs and processes to keep our catalog growing and accurate; however, it seems like to grow the catalog to x items we need to grow the team to y.

I need to figure some ways to increase the efficiency of the data team and hopefully I can learn from the work of others. Any suggestions are appreciated, more though would be links to content I could spend some serious time reading.

Solution

Use visitors.

Even if you have one person per item, there will be wrong records, and customers will find it. So, let them mark items as "inappropiate" and make a short comment. But don't forget, they're not your employees, don't ask them too much; see Facebook's "like" button, it's easy to use, and requires not too much energy from the user. Good performance/price. If there would be a mandatory field in Facebook, which asks "why do you like it?", no one should use that function.
Visitors also helps you implicite way: they visit item pages, and use search function (I mean both internal search engine and external ones, like Google). You can gain information from visitors' activity, say, set up the order of the most visited items, then you should concentrate more human forces on the top of the list, and less for the "long tail".

OTHER TIPS

Since this is more about managing the team/code/data rather than implementation and since you mentioned Amazon I think you'll find this useful: http://highscalability.com/amazon-architecture.

In particular, click the link to Werner Vogels interview.

Build it right in the first place. Ensure that you use every integrity checking method available in the database you're using, as appropriate to what you're storing. Better that an upload fail than bad data get silently introduced.

Then, figure out what you're going to do in terms of your own integrity checking. DB integrity checks are a good start, but rarely are all you need. That will also force you to think, from the beginning, about what type of data you're working with, how you need to store it, and how to recognize and flag or reject bad or questionable data.

I can't tell you the amount of pain I've seen from trying to rework (or just day to day work with) old systems full of garbage data. Doing it right and testing it thoroughly up front may seem like a pain, and it can be, but the reward is having a system that for the most part hums along and needs little to no intervention.

As for a link, if there's anyone who's had to think about and design for scalability, it's Google. You might find this instructive, it's got some good things to keep in mind: http://highscalability.com/google-architecture

Master Data Management is another alternative to what has been proposed. Here is Microsoft's article "The What, Why, and How of Master Data Management". Data stewards are given the rights/responsibility to maintain the accuracy of the data for the enterprise.

The main ability to scale comes from aligning the technology with the business so that the data personnel are not the only people who can manage the information. Tools and processes/procedures enable business owners to help manage enterprise data.

Share date with your suppliers. Then the data is entered once.

If it is important it should be done once, else not at all.

I would invest heavily in data mining. Get as many feeds as possible about the products you are trying to sell. Get feeds about the vehicle's directly from vendors, as well as from automotive repair companies like Mitchell and Haynes.

Once you know the parts that you need, cross correlate those part numbers with part numbers that are available on the interenet. Also Cross correlate those part numbers with images, reviews, and articles. Attempt to aggregate as much information as possible in one page, and eventually allow that page to be indexed by google.

Based on the results of your data aggregation, assign a series of weights to each products. Based on the value of your weights either pass on the results to an employee and have them negotiate price with suppliers, create a page as is and link to the sources (assuming you would receive a commission), or, don't sell the part.

Once you have enough products in one place, you can then support other people who would like to add additional products to your website. The breadth of resources available on Amazon is in a large part due to supporting third party sellers and allowing those sellers to list on Amazon's website.

Especially in the auto-industry, I think their is a great value in high quality indexing which is both google findable as well as logically findable by people looking to replace a specific component. You may also want to look into selling/providing location specific services through IP geo-location based on the component they are interested in purchasing.

Much of the data managed by Site like google comes from users. I enter my data and an responsible for its accuracy. Sites have their data, and it is captured from the web. Search data is captured from a search. This is likely significantly different from what your are attempting. There is little requirement for Google staff to do anything with it.

Working with manufacturers feeds could make your efforts less manpower intensive. The trade-off is investing in the data transformation software. You may want to capture the source for each cross-reference. This will ease reloads when you get updates.

From my experience you also have the issue that cross-references may be unidirectional. A can replace B, but B can not replace A.

As long as you have manual entry, you will have errors. Anything you can do in your interface to detect these errors is likely worth the effort. Input volume to staff should scale linearly.

Review the research on attention cycles to determine if you can do something to increase quality of input and verification processes. Recent research in security scanning indicates that you may want to generate periodic errors in the verification data.

As other have noted, making it easier for users to flag errors is a good idea.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow