How to divide a dataset for training and testing when the features and targets are in two different files?

datascience.stackexchange https://datascience.stackexchange.com/questions/75894

  •  12-12-2020
  •  | 
  •  

Question

I am trying to divide a dataset into training dataset and testing dataset for multi-label classification. The datset I am working on is this one. It is divided into a file which contains the features and another file which contains the targets. They look like this below.

enter image description here

This is the image about the features.

enter image description here

This is the image about the targets.

I intend to use this dataset for multilabel classification. I am following this tutorial . Here the dataset looks like this.

enter image description here

The dataset that I am working on has 17203824 samples and 58255 different and unique labels in the target file. So to follow the tutorial what I intend to create is a new numpy 2d array with 17203824 rows and 58255 columns where appropriate indices will be marked with 1. I am able to create it. But when I try to populate with 1s in the appropriate indices, I am getting an error. Its says that I don't have enough memory. My code is given below.

questions = pd.read_csv("/kaggle/input/stacklite/questions.csv")
question_tags = pd.read_csv("/kaggle/input/stacklite/question_tags.csv")
d = {v: i[0] for i, v in np.ndenumerate(question_tags["Tag"].unique())}

y = np.zeros([questions.shape[0], len(question_tags["Tag"].unique())], dtype = int)

for k in question_tags["Tag"]:
    j = d[k]
    for i, l in enumerate(y):
        y[i][j] = 1

Can anyone please help in telling me how I should proceed?

Was it helpful?

Solution

I suggest you look at some of the common python libraries that will convert a column value into labels. Many of these functions have been around a while, and are optimized to use less memory and/or run faster. For example, you can use what is called "One Hot Encoding" from get_dummies() in pandas or "LableEncoder" from sklearn.

Here is a good reference with many methods to try, depending on your needs.

https://pbpython.com/categorical-encoding.html

Here is a sample of how word embedding works. Each word in this example is reduced to 2 values (x and y)

enter image description here

OTHER TIPS

You can load the features in a variable (X) and the 2nd column of targets in another variable (y). Then use train_test_split available in sklearn library.

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=__)

BTW you need to preprocess the data first. There are a lot of NAN values, id and date probably has no value in predicting the target.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top