What is the best Data Mining algorithm for prediction based on a single variable?

https://datascience.stackexchange.com/questions/2273

16-10-2019
|

سؤال

I have a variable whose value I would like to predict, and I would like to use only one variable as predictor. For instance, predict traffic density based on weather.

Initially, I thought about using Self-Organizing Maps (SOM), which performs unsupervised clustering + regression. However, since it has an important component of dimensionality reduction, I see it as more appropriated for a large number of variables.

Does it make sense to use it for a single variable as predictor? Maybe there are more adequate techniques for this simple case: I used "Data Mining" instead of "machine learning" in the title of my question, because I think maybe a linear regression could do the job...

المحلول

Common rule in machine learning is to try simple things first. For predicting continuous variables there's nothing more basic than simple linear regression. "Simple" in the name means that there's only one predictor variable used (+ intercept, of course):

y = b0 + x*b1

where b0 is an intercept and b1 is a slope. For example, you may want to predict lemonade consumption in a park based on temperature:

cons = b0 + temp * b1

Temperature is in well-defined continuous variable. But if we talk about something more abstract like "weather", then it's harder to understand how we measure and encode it. It's ok if we say that the weather takes values {terrible, bad, normal, good, excellent} and assign values numbers from -2 to +2 (implying that "excellent" weather is twice as good as "good"). But what if the weather is given by words {shiny, rainy, cool, ...}? We can't give an order to these variables. We call such variables categorical. Since there's no natural order between different categories, we can't encode them as a single numerical variable (and linear regression expects numbers only), but we can use so-called dummy encoding: instead of a single variable weather we use 3 variables - [weather_shiny, weather_rainy, weather_cool], only one of which can take value 1, and others should take value 0. In fact, we will have to drop one variable because of collinearity. So model for predicting traffic from weather may look like this:

traffic = b0 + weather_shiny * b1 + weather_rainy * b2  # weather_cool dropped

where either b1 or b2 is 1, or both are 0.

Note that you can also encounter non-linear dependency between predictor and predicted variables (you can easily check it by plotting (x,y) pairs). Simplest way to deal with it without refusing linear model is to use polynomial features - simply add polynomials of your feature as new features. E.g. for temperature example (for dummy variables it doesn't make sense, cause 1^n and 0^n are still 1 and 0 for any n):

traffic = b0 + temp * b1 + temp^2 * b2 [+ temp^3 * b3 + ...]

نصائح أخرى

I am more of an expert on data ETL and combining/aggregating than on the forumulas themselves. I work frequently with weather data. I like to give some suggestions on using weather data in analysis.

Two types of data are reported in US/Canada:
A. Measurements
B. Weather Type

As far as weather type (sunny, rainy, severe thunderstorm) they are either going to already be reflected in measurements (e.g., sunny, rainy) and are redundant or they are inclement weather conditions and are not necessarily reflected in the measurements.

For inclement weather types, I would have separate formulae.

For measurements, there are 7 standard daily measurements for Weather Station reporting in North America.

Temp Min/Max
Precipitation
Average Wind Speed
Average Cloudiness (percentage)
Total sunlight (minutes)
Snowfall
Snow Depth

Not all stations report all 7 daily measurements. Some report only Temp and Precipitation. So you may want to have one formula for Temp/Precipitation and an expanded formulae when all seven measurements are available.

The two links below are NOAA/NWS weather terms used in their datasets:

This document is the vocabulary for the annual summaries:

http://www1.ncdc.noaa.gov/pub/data/cdo/documentation/ANNUAL_documentation.pdf

This document is the vocabulary for the daily summaries

http://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى datascience.stackexchange