Recently, I was asked to complete a linear regression task. I was given two .csv files, merged the two files, and cleaned up the data points using pandas. The dataset has the following information:
1) the country
2)the year
3)the dependent variable provided to me is in percent format, representing the percent of total GDP accrued through taxes
4)the independent variable provided is also in percent format, representing the share of population considered ‘urban’
I was asked to do simple linear regression on the two variables, so I attempted to find a direct linear relationship between the unaltered data, and due to the nature of the data, ran into some issues.
Neither of my assigned variables are continuous data, and as I do not have the continuous information such as total GDP, or total population, this information is not enough to find a linear relationship including a constant that makes sense for the almost 5000 data points.
I attempted to groupby the country and scale the data that way, since the percentages are in fact, per country and not percentages of total GDP for the global economy. This did not work either, and my results continued to be poor on the training data. I log transformed the dependent variable, which did not normalize my data, nor did it provide better results, if anything, these were worse.
After getting R-squared results below 0.2 consistently, I decided to add the categorical data, the country value as categorical, dummy encoded data. This gave me an R-squared value over 0.8, showing much better results.
Basically, the linear regression problem in question can not be completed without the total GDP information. Additional data is needed to complete a task such as this, with percentage data as the only variables offered.