Quali sono le migliori pratiche per la raccolta, il mantenimento e garantire la precisione di un insieme di dati enorme?

https://stackoverflow.com/questions/4505502

12-10-2019
|

Domanda

sto porre questa domanda cerca di consigli pratici su come progettare un sistema.

Siti come amazon.com e Pandora hanno e mantengono enormi insiemi di dati per eseguire il loro core business. Per esempio, Amazon (e ogni altro sito importante e-commerce) ha milioni di prodotti in vendita, immagini di tali prodotti, prezzi, specifiche, ecc ecc ecc.

Ignorando i dati provenienti da venditori 3rd party e contenuti generati dagli utenti tutta quella "roba" doveva venire da qualche parte e viene mantenuta da qualcuno. E 'anche incredibilmente dettagliato e preciso. Come? Come lo fanno? C'è solo un esercito di impiegati di immissione dati o hanno messo a punto sistemi per gestire il lavoro sporco?

La mia azienda è in una situazione simile. Manteniamo un catalogo enorme (10-di-milioni di record) di pezzi di ricambio e le vetture si inseriscono. Siamo stati in esso per un po 'di tempo e sono venuto con una serie di programmi e processi per mantenere il nostro catalogo in crescita e accurate; tuttavia, sembra crescere il catalogo per x gli elementi di cui abbiamo bisogno per crescere la squadra a y .

Ho bisogno di capire alcuni modi per aumentare l'efficienza del team di dati e spero di poter imparare dal lavoro di altri. Tutti i suggerimenti sono apprezzati, anche se sarebbe più link a contenuti che potrei passare un po 'di tempo a leggere sul serio.

Soluzione

Utilizzare i visitatori.

Anche se si dispone di una sola persona per ogni articolo, ci saranno record sbagliate, e clienti trovarlo. Quindi, far loro segnano elementi come "inappropiate" e fare un breve commento. Ma non dimenticate, non sono i vostri dipendenti, non chiedere loro troppo; vedi di Facebook "Mi piace", è facile da usare, e richiede non troppa energia da parte dell'utente. Buone prestazioni / prezzo. Se ci sarebbe stato un campo obbligatorio in Facebook, che chiede "Perché ti piace?", Nessuno dovrebbe usare questa funzione.
I visitatori vi aiuta anche modo implicite: visitano le pagine di articoli, e utilizzare la funzione di ricerca (intendo sia motore di ricerca interno e quelli esterni, come Google). È possibile ottenere informazioni da attività di visitatori, per esempio, impostare l'ordine delle voci più visitati, allora si dovrebbe concentrare le forze più umani in cima alla lista, e meno per la 'coda lunga'.

Altri suggerimenti

Poiché si tratta di informazioni sulla gestione del Team / codice / dati anziché attuazione e poiché lei ha citato Amazon penso che troverete questo utile: http://highscalability.com/amazon-architecture .

In particolare, fare clic sul collegamento per intervista Werner Vogels.

Build it right in the first place. Ensure that you use every integrity checking method available in the database you're using, as appropriate to what you're storing. Better that an upload fail than bad data get silently introduced.

Then, figure out what you're going to do in terms of your own integrity checking. DB integrity checks are a good start, but rarely are all you need. That will also force you to think, from the beginning, about what type of data you're working with, how you need to store it, and how to recognize and flag or reject bad or questionable data.

I can't tell you the amount of pain I've seen from trying to rework (or just day to day work with) old systems full of garbage data. Doing it right and testing it thoroughly up front may seem like a pain, and it can be, but the reward is having a system that for the most part hums along and needs little to no intervention.

As for a link, if there's anyone who's had to think about and design for scalability, it's Google. You might find this instructive, it's got some good things to keep in mind: http://highscalability.com/google-architecture

Master Data Management is another alternative to what has been proposed. Here is Microsoft's article "The What, Why, and How of Master Data Management". Data stewards are given the rights/responsibility to maintain the accuracy of the data for the enterprise.

The main ability to scale comes from aligning the technology with the business so that the data personnel are not the only people who can manage the information. Tools and processes/procedures enable business owners to help manage enterprise data.

Share date with your suppliers. Then the data is entered once.

If it is important it should be done once, else not at all.

I would invest heavily in data mining. Get as many feeds as possible about the products you are trying to sell. Get feeds about the vehicle's directly from vendors, as well as from automotive repair companies like Mitchell and Haynes.

Once you know the parts that you need, cross correlate those part numbers with part numbers that are available on the interenet. Also Cross correlate those part numbers with images, reviews, and articles. Attempt to aggregate as much information as possible in one page, and eventually allow that page to be indexed by google.

Based on the results of your data aggregation, assign a series of weights to each products. Based on the value of your weights either pass on the results to an employee and have them negotiate price with suppliers, create a page as is and link to the sources (assuming you would receive a commission), or, don't sell the part.

Once you have enough products in one place, you can then support other people who would like to add additional products to your website. The breadth of resources available on Amazon is in a large part due to supporting third party sellers and allowing those sellers to list on Amazon's website.

Especially in the auto-industry, I think their is a great value in high quality indexing which is both google findable as well as logically findable by people looking to replace a specific component. You may also want to look into selling/providing location specific services through IP geo-location based on the component they are interested in purchasing.

Much of the data managed by Site like google comes from users. I enter my data and an responsible for its accuracy. Sites have their data, and it is captured from the web. Search data is captured from a search. This is likely significantly different from what your are attempting. There is little requirement for Google staff to do anything with it.

Working with manufacturers feeds could make your efforts less manpower intensive. The trade-off is investing in the data transformation software. You may want to capture the source for each cross-reference. This will ease reloads when you get updates.

From my experience you also have the issue that cross-references may be unidirectional. A can replace B, but B can not replace A.

As long as you have manual entry, you will have errors. Anything you can do in your interface to detect these errors is likely worth the effort. Input volume to staff should scale linearly.

Review the research on attention cycles to determine if you can do something to increase quality of input and verification processes. Recent research in security scanning indicates that you may want to generate periodic errors in the verification data.

As other have noted, making it easier for users to flag errors is a good idea.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow