Question

I have two files : Test_data - contains the features of a dataset to find predictions for Submission_data - contains two columns : The index column for test data and another column for its corresponding predicted value

So , I have to make predictions on the test data and store the predicted values in the submission file.

During preprocessing of the test data , I am dropping rows that do not contain values (NaN) for atleast 50% of the features(columns) :

test_data = test_data.dropna(thresh=math.ceil(test_data.shape[1]/2))

Now , How do I remove the corresponding rows in the submissions dataframe ? Because , If I drop some rows in the test data , I cannot make a prediction for the corresponding row in the submissions dataframe/file.

The problem is , there is an Index column that does NOT HAVE UNIQUE values (In both test data and submissions data)

So , How do I drop the rows in Submissions data that were also dropped in Test data ?

I am new to ML challenges and I find this challenging .

Was it helpful?

Solution

When you read the two csv files and store the data in two dataframes, you could then combine it into one dataframe, do the dropna and then split it back. I will give an example using pandas

import pandas as pd df1 = pd.read_csv('test_data.csv') df2 = pd.read_csv('submission_data.csv') df3 = pd.concat([df1, df2], axis=1) # this will combine the two dfs.

reduced_data = df3.dropna(thresh=math.ceil(test_data.shape[1]/2)) predictions = reduced_data.loc[:,['predictions']] reduced_data.drop(columns=['predictions'], inplace=True)

#instead of 'predictions', use whatever column name you have for the predictions in submission_data.csv file.

Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top