Bias or not when finding patterns using data mining techniques?

https://cs.stackexchange.com/questions/90413

05-11-2019
|

Pergunta

I am currently following a course on Data Mining and i am very curious about the deeper underlying method. As far as i have learned so far data mining is about finding unknown patterns that can be useful and provide new knowledge about your data.

In data mining, is it okay to start from expectations (bias) as to which patterns could be present and actually do statistics to see if this is actually the case. Say if i have data for the typical example of survivors on the titanic. How would i start doing my analysis - that is: what types of questions would i be asking my self to begin with. Say if i would like to test whether the survival percentage was smaller for a male passenger, i could do some statistical analysis, and find out whether or not that would be the case. I could programme a decision tree and see what my data tells me. That would tell me HOW to use machine learning to analyse the data in order to be able to predict what chance of survival a new passenger x with specific 'properties' would have. What would the data mining perspective come into this process?

I am aware of different types of classifiers and how we can use them to check for patterns in order to do predictions, but how does one go from A (wanting to find patterns) to B (actually finding unknown patterns)) in data mining specifically?

Citing directly from the wiki page of data mining: "Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps." and "Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD."

This indicates that the actual data mining part is the analysis using mathematics, statistics and machine learning tools. Well from here, how would i start finding patterns in a large data set. Would i start from what my prior tells my could be interesting and dig deeper to see whether there actually is an interesting pattern here - and if not, i can conclude that there isn't, which in itself is insightful - and then i can try finding other patterns. OR do i make an algorithm that randomly tries to find correlations and patterns across different random combinations of attributes in my data, without me choosing ANY direction to look in? Because one thing i have been taught is that when you work with REALLY large datasets, all sorts of patterns begin to emerge, and no matter where you look, you will find some sort of patterns. The art is to find USEFUL patterns, and patterns that actually provide some kind of insight into your data! As always correlation does not necessary imply causation, but that is for the analysis part, and i guess data mining is all about just finding the patterns - so how do we actually go about and do this.

I hope my question is possible to understand. I find it hard to formulate any better. If i were to boil it all down into one sentence, it would be: If i have a large dataset which i have cleaned and prepared, then from a data mining perspective what is the following thought-process I should use to find patterns in the data?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange