Calculate all days for each years between two dates in a pandas dataframe

https://datascience.stackexchange.com/questions/81242

13-12-2020
|

Question

I'm new in python and coding. I'm doing a university project exercise. Last question is :

" For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in different years must have its amount distributed proportionally to the number of days in each year.

For example, a loan with:

disburse time = 2016/12/01
planned expiration time = 2018/01/30
amount = 5000USD

has an amount of :

5000 * 31 / (31+365+30) = 363.85 for 2016,
5000 * 365 / (31+365+30) = 4284.04 for 2017,
5000 * 30 / (31+365+30) = 352.11 for 2018. "

I splitted original dataset in two dataset:

one where "planned expiration time" and "disburse time" have the same year;
one where "planned expiration time" and "disburse time" have different year.

I computed the total amount of loans for each years for the first split dataset, but i don't understand how to compute total amount of loans for each years for the second split dataset.

I'm using Jupyter notebook. My code:

loans_cleaned = loans_cleaned.dropna(subset= ['planned_expiration_time', 'disburse_time']) 

loans_cleaned

loans_cleaned['planned_expiration_time'] = loans_cleaned['planned_expiration_time'].dt.tz_localize(None)

loans_cleaned['disburse_time'] = loans_cleaned['disburse_time'].dt.tz_localize(None)

loans_cleaned['planned_expiration_time'] = loans_cleaned['planned_expiration_time'].dt.normalize()

loans_cleaned['disburse_time'] = loans_cleaned['disburse_time'].dt.normalize()

loans_same_year = pd.DataFrame(loans_cleaned[loans_cleaned['planned_expiration_time'].dt.year == loans_cleaned['disburse_time'].dt.year][["loan_id" , "disburse_time", "planned_expiration_time","loan_amount"]])

loans_same_year.reset_index(drop=True, inplace=True)
loans_same_year

    loan_id  disburse_time  planned_expiration_time     loan_amount
0   658010   2014-01-09     2014-02-15                  400.0
1   659347   2014-01-17     2014-02-21                  625.0
2   659605   2014-01-15     2014-02-20                  350.0
3   660240   2014-01-20     2014-02-21                  125.0
4   661601   2014-01-10     2014-02-25                  1600.0
... ... ... ... ...

loans_same_year['year'] = loans_same_year['disburse_time'].dt.year

loans_amount_year = pd.DataFrame(loans_same_year.groupby('year')['loan_amount'].sum().reset_index())

loans_amount_year

    year    loan_amount
0   2012    103911725.0
1   2013    98427750.0
2   2014    120644250.0
3   2015    131208475.0
4   2016    133271575.0
5   2017    144870625.0
6   2018    85300.0

loans_different_year = pd.DataFrame(loans_cleaned[loans_cleaned['planned_expiration_time'].dt.year != loans_cleaned['disburse_time'].dt.year][["loan_id" , "disburse_time", "planned_expiration_time","loan_amount"]])

loans_different_year.reset_index(drop=True, inplace=True)

loans_different_year

How can I compute the total days for each years of each loans in the loans_different_year and compute the total amount of loans for each years? Thanks for the attention.

i tried to:

def func(disburse_time, planned_time):
    cost=loans_different_year['loan_amount']
    for year in range(disburse_time.year, planned_time.year+1):
        if year==disburse_time.year:
            dict_map[year] = (datetime.date(year, 12, 31) - disburse_time).days
        elif year==planned_time.year:
            dict_map[year] = (planned_time - datetime.date(year-1, 12, 31)).days
        else:
            if year%4==0:
                dict_map[year]=366
            else:
                dict_map[year]=365
    dict_year_share = {year:cost*days/sum(dict_map.values()) for year,days in dict_map.items()}
    return dict_year_share

a = loans_different_year.apply(lambda x: func(x['disburse_time'], x['planned_expiration_time']), axis=1)
a

TypeError: unsupported type for timedelta days component: Timestamp

in the follow line code:
dict_map[year] = (datetime.date(year, 12, 31) - disburse_time).days

i setted:

loans_different_year['planned_expiration_time'] = pd.to_datetime(loans_different_year['planned_expiration_time'])

loans_different_year['disburse_time'] = pd.to_datetime(loans_different_year['disburse_time'])
```

Solution

Not sure if a 1-2 liner is possible. This can be a working function

import pandas as pd, datetime

def func(disburse_time, planned_time, loan_amount):
    total_cost=loan_amount
    dict_map = {}
    for year in range(disburse_time.year, planned_time.year+1):
        if year==disburse_time.year:
            dict_map[year] = (pd.to_datetime(datetime.date(year, 12, 31)) - disburse_time).days
        elif year==planned_time.year:
            dict_map[year] = (planned_time - pd.to_datetime(datetime.date(year-1, 12, 31))).days
        else:
            if year%4==0:
                dict_map[year]=366
            else:
                dict_map[year]=365

    dict_year_share = {year:total_cost*days/sum(dict_map.values()) for year,days in dict_map.items()}
    return dict_year_share

Calling on the df rows

dataset = pd.read_csv('/content/loan.csv')
dataset.disburse_time = pd.to_datetime(dataset.disburse_time,format="%Y-%d-%m")
dataset.planned_expiration_time = pd.to_datetime(dataset.planned_expiration_time,format="%Y-%d-%m")

result = dataset.apply(lambda x: pd.Series(func(x['disburse_time'], x['planned_expiration_time'],x['loan_amount'])), axis=1)
result

This is my dataset

"disburse_time","planned_expiration_time","loan_amount"
"1959-01-01","1960-30-11",35000
"1959-01-02","1962-31-08",32000
"1959-01-03","1965-31-05",30000
"1959-01-11","1959-30-11",31000
"1959-01-10","1961-31-03",44000

Also, need a check for boundary scenarios.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange