How can I find to starting point of skewed data in python?
-
09-12-2020 - |
Question
I have a list like this,
import random
import seaborn as sns
years = []
for i in range(1000):
if i % 100 == 0:
val = random.randint(1900, 2000)
else:
val = random.randint(2000, 2021)
years.append(val)
sns.distplot(years);
As you can see, there is a density after 2000. There is not much data before this point. My question is how can I find this point in skewed data? Is there a formula that gives this? Any idea? Thanks in advance.
Solution
Depending on the level of what you want, I would suggest to just start with removing the the data with lower count :
- Bin your data (equivalent to what you did by plotting the histogram)
- Count the value in each bin
- Look at the distribution of such values.
- Remove the lowest counts
- Get the cut off as the min of what is remaining
- Try different bin size
That should cover getting the value.
Then you may want to make some assumption on the underlying process then try some statistical test on data before / after to see if the difference is significant.
OTHER TIPS
Try looking at ways to find outliers, such as Tukey's fences or the modified Thompson's tau.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange