Question

Apologies if this question is not a suitable format. I am a novice in data science.

I have a database of species observation data consisting of ~16 million records. Each record consists of:

  • latitude
  • longitude
  • date
  • time
  • species observed (that's species singular, not plural)

This data has been manually vetted by experts, so there is an additional field for each species in a record that classifies the observation as either valid or invalid (or more accurately speaking, likely correct/likely incorrect)

I am exploring the idea of training a neural network on this data to automatically classify new records as being valid or invalid ("invalid" data will be flagged for manual expert review.)

The vast majority of records are classified as 'valid', so my worry is that there isn't much information to train the model on what constitutes 'invalid'.

However, a good predictor of whether a record is valid is, informally speaking, "are there other records of this species close by (spatially and/or temporally)"

I'm not sure where to start with formulating a neural network for this problem. E.g.

Inputs: latitude, longitude, date, time, species

Output: validity

OR

Inputs: latitude, longitude, date, time

Outputs: one output for each known species indicating validity

I like the idea of this second model as I can input a time and location and get out a list of likely species.


So my concrete questions are:

  1. Does this sounds like an application suitable for a neural network?

  2. If so, where might I start with formulating a model for my problem? Or can someone point me in a good direction to learn more about this topic.

Was it helpful?

Solution

Before deciding on the model, I would recommend to re-formulate the dataset to best suit your problem. You could approach this problem as follows:

  1. Since the output you're trying to predict is validity of the observation, keep "validity" = True/False, or, 1/0 as the target variable.
  2. One of the parameters is a categorical variable "species", and I'm expecting this to have a high cardinality. Since there are approximately 8.7 million species on earth, if you used this variable in a model it could possibly expand into 8.7 million individual columns (in on-hot encoding form). Even a conservative estimate of 100,000 species makes it nonviable to be used as it is. So you need a way to convert this species information to a fewer features.
  3. One approach you could try out is to create geographical clusters for each species (using only valid marked records), then identify the nearest center and max/avg./quartile measures of distance from their cluster center for each species. Do this for each quarter of the year separately to account for seasonal changes. Next, add this information back to the main dataset to indicate for each record - all the geographical centers of that species cluster. In the next step, for each record find the nearest cluster center and calculate this particular observation's distance from its cluster center. Then calculate the ratio of its distance from cluster center vs. max distance and vs. avg distance from that cluster's center. Use this metric instead of the geospatial coordinates and species identifier.
  4. Another approach could be to add additional features such as the climate of each location and average historic temperature at that location during the time of the year when the observation was taken. This is because some animals may migrate north/south based on the seasons and so if a species' location was found valid in the summer, it may be impossible to find it in the same location in winter due to it being unable to survive the cold weather. If you combine this with #3 above, it would enrich the observations significantly.

After doing this extensive hard work, you should do some exploratory analysis and plot subsets of this data to better understand it. By visualizing the data, sometimes we're able to figure out best course of action more quickly than without visualizing the data.

Next, you may explore different machine-learning algorithms to fit a model to this refined data. I would recommend trying out other algorithms such as logistic regression, SVM, ridge-regression, random forests and gradient boosting machines in addition to neural networks and then select the best performing one. Most machine learning suites/frameworks implement these, so it should not be difficult to find out how to apply these to your dataset.

Neural networks are fine to try out, but as with all algorithms you need to be careful about the usual pitfalls such as:

  1. Avoid over-fitting the model to training data: to avoid this use regularization and keep cross checking accuracy with an independent held-out validation set.
  2. Use cross-validation (10-fold) and repeat several times to get good estimates of the model's performance metrics on new data.
  3. Since the data is highly class imbalanced (many valid records but few invalid records by proportion), use a performance metric other than simple true positive accuracy. Try using F1 score, precision (of identifying invalid records), Kappa metric, etc.
  4. Due to high class imbalance, it would help if you either over-sampled the minority class (invalid) or under-sampled the majority class (valid), or did both together. This will improve the model's ability to classify mode precisely.
  5. Adjust the hyper-parameters such as learning rate and hidden layers/no. of units for best model performance.

OTHER TIPS

You may or may not apply neural network to your problem. Neural network will give predictions for sure but it can't be said how much efficient will it be for this problem. You have to code and check it for yourself. Also, if you have all the above data in tabular form and you know the labels(which I guess is valid/inavlid), I will say try it with xgboost. Neural networks are exceptional for unsupervised learning but in the case of supervised learning, there may be a model which may outperform a neural network.

The basics of machine learning is this. Is there a pattern to the data? If so then there is a formula which describes the pattern. If the formula is known, use the formula. If there is a pattern but the formula describing the pattern is unknown, you can use machine learning to determine the formula which best approximates it.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top