Calculate all days for each years between two dates in a pandas dataframe
Question
I'm new in python and coding. I'm doing a university project exercise. Last question is :
" For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in different years must have its amount distributed proportionally to the number of days in each year.
For example, a loan with:
- disburse time = 2016/12/01
- planned expiration time = 2018/01/30
- amount = 5000USD
has an amount of :
- 5000 * 31 / (31+365+30) = 363.85 for 2016,
- 5000 * 365 / (31+365+30) = 4284.04 for 2017,
- 5000 * 30 / (31+365+30) = 352.11 for 2018. "
I splitted original dataset in two dataset:
- one where "planned expiration time" and "disburse time" have the same year;
- one where "planned expiration time" and "disburse time" have different year.
I computed the total amount of loans for each years for the first split dataset, but i don't understand how to compute total amount of loans for each years for the second split dataset.
I'm using Jupyter notebook. My code:
loans_cleaned = loans_cleaned.dropna(subset= ['planned_expiration_time', 'disburse_time'])
loans_cleaned
loans_cleaned['planned_expiration_time'] = loans_cleaned['planned_expiration_time'].dt.tz_localize(None)
loans_cleaned['disburse_time'] = loans_cleaned['disburse_time'].dt.tz_localize(None)
loans_cleaned['planned_expiration_time'] = loans_cleaned['planned_expiration_time'].dt.normalize()
loans_cleaned['disburse_time'] = loans_cleaned['disburse_time'].dt.normalize()
loans_same_year = pd.DataFrame(loans_cleaned[loans_cleaned['planned_expiration_time'].dt.year == loans_cleaned['disburse_time'].dt.year][["loan_id" , "disburse_time", "planned_expiration_time","loan_amount"]])
loans_same_year.reset_index(drop=True, inplace=True)
loans_same_year
loan_id disburse_time planned_expiration_time loan_amount
0 658010 2014-01-09 2014-02-15 400.0
1 659347 2014-01-17 2014-02-21 625.0
2 659605 2014-01-15 2014-02-20 350.0
3 660240 2014-01-20 2014-02-21 125.0
4 661601 2014-01-10 2014-02-25 1600.0
... ... ... ... ...
loans_same_year['year'] = loans_same_year['disburse_time'].dt.year
loans_amount_year = pd.DataFrame(loans_same_year.groupby('year')['loan_amount'].sum().reset_index())
loans_amount_year
year loan_amount
0 2012 103911725.0
1 2013 98427750.0
2 2014 120644250.0
3 2015 131208475.0
4 2016 133271575.0
5 2017 144870625.0
6 2018 85300.0
loans_different_year = pd.DataFrame(loans_cleaned[loans_cleaned['planned_expiration_time'].dt.year != loans_cleaned['disburse_time'].dt.year][["loan_id" , "disburse_time", "planned_expiration_time","loan_amount"]])
loans_different_year.reset_index(drop=True, inplace=True)
loans_different_year
How can I compute the total days for each years of each loans in the loans_different_year and compute the total amount of loans for each years? Thanks for the attention.
i tried to:
def func(disburse_time, planned_time):
cost=loans_different_year['loan_amount']
for year in range(disburse_time.year, planned_time.year+1):
if year==disburse_time.year:
dict_map[year] = (datetime.date(year, 12, 31) - disburse_time).days
elif year==planned_time.year:
dict_map[year] = (planned_time - datetime.date(year-1, 12, 31)).days
else:
if year%4==0:
dict_map[year]=366
else:
dict_map[year]=365
dict_year_share = {year:cost*days/sum(dict_map.values()) for year,days in dict_map.items()}
return dict_year_share
a = loans_different_year.apply(lambda x: func(x['disburse_time'], x['planned_expiration_time']), axis=1)
a
TypeError: unsupported type for timedelta days component: Timestamp
in the follow line code:
dict_map[year] = (datetime.date(year, 12, 31) - disburse_time).days
i setted:
loans_different_year['planned_expiration_time'] = pd.to_datetime(loans_different_year['planned_expiration_time'])
loans_different_year['disburse_time'] = pd.to_datetime(loans_different_year['disburse_time'])
```
Solution
Not sure if a 1-2 liner is possible. This can be a working function
import pandas as pd, datetime
def func(disburse_time, planned_time, loan_amount):
total_cost=loan_amount
dict_map = {}
for year in range(disburse_time.year, planned_time.year+1):
if year==disburse_time.year:
dict_map[year] = (pd.to_datetime(datetime.date(year, 12, 31)) - disburse_time).days
elif year==planned_time.year:
dict_map[year] = (planned_time - pd.to_datetime(datetime.date(year-1, 12, 31))).days
else:
if year%4==0:
dict_map[year]=366
else:
dict_map[year]=365
dict_year_share = {year:total_cost*days/sum(dict_map.values()) for year,days in dict_map.items()}
return dict_year_share
Calling on the df rows
dataset = pd.read_csv('/content/loan.csv')
dataset.disburse_time = pd.to_datetime(dataset.disburse_time,format="%Y-%d-%m")
dataset.planned_expiration_time = pd.to_datetime(dataset.planned_expiration_time,format="%Y-%d-%m")
result = dataset.apply(lambda x: pd.Series(func(x['disburse_time'], x['planned_expiration_time'],x['loan_amount'])), axis=1)
result
This is my dataset
"disburse_time","planned_expiration_time","loan_amount"
"1959-01-01","1960-30-11",35000
"1959-01-02","1962-31-08",32000
"1959-01-03","1965-31-05",30000
"1959-01-11","1959-30-11",31000
"1959-01-10","1961-31-03",44000
Also, need a check for boundary scenarios.