Pergunta

Let's say if I have 2 TB of data, what is the best sample size to pick? I understand that there is a limit on how much RAM/processing power I have and hence I should make my sampling decision around that. But let's say if the processing power is not a concern for me right now. What would be a good way to approach my sample size?

Foi útil?

Solução

This is a tough question to answer without more information. I'm going to assume that this is for model building, but without more detail it's hard to recommend something.

However, there are some things which should generally be known:

Population size

How large is the population? Does your 2TB of data comprise the total population, or is this a sample of a given timeframe? What frame of data are you looking at - is this 2 days worth of data that is only representative for a given subset of the population, or is this everything? You'll need to know this to know what conclusions you can draw from this dataset.

Variance

What's the variance of the sample? If it's categorical data, how many unique values are there? Having a metric around this will help determine the number of samples you'll need. If this is a low variance set, you may only need a few hundred/thousand observations.

Stratification/grouping

Is your data grouped in a meaningful way? If so, you'll need to factor this into your sample. Depending on what you're doing, you'll want a meaningful representation of the population. If the data is not grouped, but has distinct groups within it that you care about, you may need to stratify or pre-process your data.

Model and goals

All of this ends up coming down to what you're trying to do. If you're trying to classify or parse a set of unique entities, you may be better off streaming a large set of your data rather than trying to sample it. If you're trying to classify images or customers based on behavior, you may only need a small subset depending on how these groups differ.

Licenciado em: CC-BY-SA com atribuição
scroll top