Quanto grande è grande dati?

https://datascience.stackexchange.com/questions/19

16-10-2019
|

Domanda

Un sacco di persone usano il termine big data in maniera piuttosto commerciale modo, come un dispositivo che indichi che le grandi insiemi di dati sono coinvolti nel calcolo, e quindi potenziali soluzioni devono avere buone prestazioni. Naturalmente, big data termini porto sempre connessi, come la scalabilità e l'efficienza, ma ciò che definisce esattamente un problema come un Big Data problema?

Fa il calcolo devono essere relative a un insieme di scopi specifici, come il recupero dei dati mining / informazioni, o potrebbe un algoritmo per problemi generali grafico essere etichettato big data se il set di dati è stato abbastanza grande ? Inoltre, come big è abbastanza grande (se questo è possibile definire)?

È stato utile?

Soluzione

Per me (proveniente da un background database relazionale), "Big Data" non è principalmente circa la dimensione dei dati (che è la maggior parte di ciò che le altre risposte sono finora).

"Big Data" e "Bad dati" sono strettamente correlati. Database relazionali require 'dati incontaminate. Se i dati sono nel database, è preciso, pulito e affidabile al 100%. Database relazionali richiedono "Great dati" e una quantità enorme di tempo, denaro, e la responsabilità è messo a fare in modo che i dati sono ben preparato prima di caricarlo al database. Se i dati sono nel database, è 'gospel', e definisce il sistema di comprensione della realtà.

"Big Data" affronta questo problema dalla direzione opposta. I dati sono poco definita, gran parte di essa può essere imprecise, e gran parte di essa può infatti essere mancante. La struttura e la disposizione dei dati è lineare anziché relazionale.

Big Data deve avere un volume sufficiente in modo che la quantità di dati non validi, o dati mancanti diventa statisticamente insignificante. Quando gli errori nei dati sono abbastanza comuni a annullano a vicenda, quando i dati mancanti è proporzionalmente abbastanza piccolo da essere trascurabile e quando i requisiti di accesso dati e algoritmi sono funzionali anche con dati incompleti e inesatti, poi si dispone di "Big Data" .

"Big Data" non è in realtà circa il volume, è circa le caratteristiche dei dati.

Altri suggerimenti

Come lei ha giustamente nota, in questi giorni "Dati grande" è qualcosa che tutti vogliono dire che loro hanno, che comporta una certa scioltezza nel modo in cui le persone definiscono il termine. In generale, però, direi che si sta certamente a che fare con grandi dati se la scala è tale che non è più fattibile per gestire con le tecnologie più tradizionali come RDBMS, almeno senza integrandole con grandi tecnologie di dati come Hadoop.

Quanto è grande il tuo dati deve essere in realtà per che per essere il caso è discutibile. Ecco una (un po 'provocatoria) post sul blog che le affermazioni che non è davvero il caso per meno di 5 TB di dati. (Per essere chiari, non ha la pretesa "Meno di 5 TB non è grande di dati", ma solo "meno di 5 TB non è abbastanza grande che avete bisogno di Hadoop".)

Ma anche su insiemi di dati più piccole, grandi tecnologie dati come Hadoop possono avere altri vantaggi, tra cui essendo ben adatta per operazioni batch, giocando bene con dati non strutturati (nonché i dati la cui struttura non è noto in anticipo o potrebbe cambiare), scalabilità orizzontale (scala con l'aggiunta di più nodi, invece di rimpolpare i server esistenti), e (come uno dei commentatori sulle note pubblicare sopra-linked) la capacità di integrare i dati di elaborazione con set di dati esterni (si pensi a una mappa-ridurre dove il mapper effettua una chiamata a un altro server). Altre tecnologie associati ai dati grandi, come i database NoSQL, sottolineano prestazioni veloci e la disponibilità costante mentre si occupano di grandi insiemi di dati, così anche essere in grado di gestire i dati semi-strutturati e per scalare orizzontalmente.

Naturalmente, RDBMS tradizionali hanno i loro vantaggi, tra cui garanzie ACID (Atomicità, Consistenza, Isolamento, Durabilità) e prestazioni migliori per determinate operazioni, oltre ad essere più standardizzati, più maturo, e (per molti utenti) più familiare. Così anche per indiscutibilmente dati "grandi", può avere senso per il carico di almeno una parte dei vostri dati in un database SQL tradizionale e l'uso che, in combinazione con grandi tecnologie dati.

Quindi, una definizione più generosa sarebbe che si hanno grandi dati in modo fino a quando è abbastanza grande che le grandi tecnologie dati forniscono un valore aggiunto per voi. Ma come si può vedere, che può dipendere non solo dalle dimensioni dei dati, ma su come si desidera lavorare con esso e che tipo di esigenze che avete in termini di flessibilità, coerenza, e le prestazioni. Come che si sta utilizzando i dati è più rilevante per la domanda di quello che si sta utilizzando per (ad esempio, data mining). Detto questo, usi, come data mining e machine learning hanno maggiori probabilità di produrre risultati utili se si dispone di un grande set di dati sufficienti per lavorare con.

Total amount of data in the world: 2.8 zetabytes in 2012, estimated to reach 8 zetabytes by 2015 (source) and with a doubling time of 40 months. Can't get bigger than that :)

As an example of a single large organization, Facebook pulls in 500 terabytes per day, into a 100 petabyte warehouse, and runs 70k queries per day on it as of 2012 (source) Their current warehouse is >300 petabytes.

Big data is probably something that is a good fraction of the Facebook numbers (1/100 probably yes, 1/10000 probably not: it's a spectrum not a single number).

In addition to size, some of the features that make it "big" are:

it is actively analyzed, not just stored (quote "If you aren’t taking advantage of big data, then you don’t have big data, you have just a pile of data" Jay Parikh @ Facebook)

building and running a data warehouse is a major infrastructure project

it is growing at a significant rate

it is unstructured or has irregular structure

Gartner definition: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing" (The 3Vs) So they also think "bigness" isn't entirely about the size of the dataset, but also about the velocity and structure and the kind of tools needed.

To me Big Data is primarily about the tools (after all, that's where it started); a "big" dataset is one that's too big to be handled with conventional tools - in particular, big enough to demand storage and processing on a cluster rather than a single machine. This rules out a conventional RDBMS, and demands new techniques for processing; in particular, various Hadoop-like frameworks make it easy to distribute a computation over a cluster, at the cost of restricting the form of this computation. I'll second the reference to http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html ; Big Data techniques are a last resort for datasets which are simply too big to handle any other way. I'd say any dataset for any purpose could qualify if it was big enough - though if the shape of the problem is such that existing "big data" tools aren't appropriate, then it would probably be better to come up with a new name.

Of course there is some overlap; when I (briefly) worked at last.fm, we worked on the same 50TB dataset using Hadoop and also in an SQL database on a fairly ridiculous server (I remember it had 1TB RAM, and this is a few years ago). Which in a sense meant it both was and wasn't big data, depending on which job you were working on. But I think that's an accurate characterization; the people who worked on the Hadoop jobs found it useful to go to Big Data conferences and websites, while the people who worked on the SQL jobs didn't.

Data becomes "big" when a single commodity computer can no longer handle the amount of data you have. It denotes the point at which you need to start thinking about building supercomputers or using clusters to process your data.

Big Data is defined by the volume of data, that's right, but not only. The particularity of big data is that you need to store a lots of various and sometimes unstructured stuffs all the times and from a tons of sensors, usually for years or decade.

Furthermore you need something scalable, so that it doesn't take you half a year to find a data back.

So here's come Big Data, where traditional method won't work anymore. SQL is not scalable. And SQL works with very structured and linked data (with all those Primary and foreign key mess, innerjoin, imbricated request...).

Basically, because storage becomes cheaper and cheaper and data becomes more and more valuable, big manager ask engineer to records everything. Add to this tons of new sensors with all those mobile, social network, embeded stuff ...etc. So as classic methods won't work, they have to find new technologies (storing everything in files, in json format, with big index, what we call noSQL).

So Big Data may be very big but can be not so big but complexe unstructured or various data which has to be store quickly and on-the-run in a raw format. We focus and storing at first, and then we look at how to link everything together.

I'll share what Big Data is like in genomics, in particular de-novo assembly.

When we sequence your genome (eg: detect novel genes), we take billions of next-generation short reads. Look at the image below, where we try to assemble some reads.

This looks simple? But what if you have billion of those reads? What if those reads contain sequence errors? What if your RAM doesn't have enough memory to keep the reads? What about repetitive DNA regions, such as the very common Alu Element?

De-novo assembly is done by constructing a De-Bruijn graph:

The graph is a clever-mined data-structure to represent overlapping reads. It's not perfect but it's better than generating all possible overlaps and store them in an array.

The assembly process could take days to complete, because there are quite a number of paths that an assembler would need to traverse and collapse.

In genomics, you have a big data when:

You can't brute force all combinations

Your computer doesn't have enough physical memory to store the data

You need to reduce the dimensions (eg: collapsing redundant graph paths)

You get pissed off because you'd have to wait days to do anything

You need a special data structure to represent the data

You need to filter your data-set for errors (eg: sequencing errors)

https://en.wikipedia.org/wiki/De_Bruijn_graph

There is special thing to graph algorithms, you original questions which makes then special, which is about he ability to partition the data essentially.

For some things, like sorting numbers on an array it is not too difficult to partition the problem on the data structure into smaller disjunctive pieces, e.g. Here: Parallel in place merge sort

For graph algorithms however there is the challenge that finding an optional partitioning on a given graphic metric is known to be $NP-hard$.

So while 10GB of numbers to sort might be a very well approachable problem on a normal PC (You can just to in via dynamic programming and have very good predictability about the program flow), working with a 10GB graph data structure can already by challenging.

There are a number of specialized frameworks such as GraphX using methods and special computing paradigms to somewhat circumvent the inherent challenges of graphs.

So to answer your question briefly: As mentioned before by others, when your data does not fit into main memory on a normal PC but you need all of it to answer your problem, is a good hint that your data is already somewhat big. The exact labeling though depends i think a bit on the data structure and question asked.

I think that big data starts at the point where the size prevents you from doing what you want to. In most scenarios, there is a limit on the running time that is considered feasible. In some cases it is an hour, in some cases it might be few weeks. As long as the data is not big enough that only O(n) algorithms can run in the feasible time frame, you didn't reach big data.

I like this definition since it is agnostic to volume, technology level and specific algorithms. It is not agnostic to resources so a grad student will reach the point of big data way before Google.

In order to be able to quantify how big is the data, I like to consider the time needed to backup it. Since the technology advances, volumes that were considered big some years ago are now moderate. Backup time improves, as the technology improves, just as the running time of the learning algorithms. I feel it is more sensible to talk about a dataset it takes X hours to backup and not of a dataset of Y bytes.

PS.

It is important to note that even if you reached the big data point and you can not run algorithms of complexity more than O(n) in the straight forward way, there is plenty you can do in order to still benefit from such algorithms.

For example, Feature selection can reduce the number of features that many algorithms running time depends on. In many long tail distribution focusing in the few items in the head might be of benefit. You can use a sample and run on it the slower algorithms.

Data is "Big Data" if it is of such volume that it is less expensive to analyze it on two or more commodity computers, than on one high-end computer.

This is essentially how Google's "BigFiles" file system originated. Page and Brin could not afford a fancy Sun server to store and search their web index, so hooked up several commodity computers

I tend to agree with what @Dan Levin has already said. Ultimately since we want to draw useful insights from the data rather than just storing it, it's the ability of learning algorithms/systems which should determine what is called "Big data". As ML systems evolve what was Big data today will no longer be Big Data tomorrow.

One way of defining Big data could be:

Big data: Data on which you can't build ML models in reasonable time ( 1-2 hours) on a typical workstation ( with say 4GB RAM)

Non-Big data: complement of the above

Assuming this definition, as long as the memory occupied by an individual row (all variables for a single data point) does not exceed machine RAM we should be be in the Non-big data regime.

Note: Vowpal Wabbit (by far the fastest ML system as of today) can learn on any data set as long as an individual row ( data point) is < RAM ( say 4GB). The number of rows is not a limitation because it uses SGD on multiple cores. Speaking from experience you can train a model with 10k features and 10MN rows on a laptop in a day.

"Big data" is literally just a lot of data. While it's more of a marketing term than anything, the implication is usually that you have so much data that you can't analyze all of the data at once because the amount of memory (RAM) it would take to hold the data in memory to process and analyze it is greater than the amount of available memory.

This means that analyses usually have to be done on random segments of data, which allows models to be built to compare against other parts of the data.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange