1. Post Secondary Education

 In Singapore Education system, a child usually goes through nursery education at the age of 4, kindergarten at the age 5 t0 6, primary school from age 7 to 12, secondary school from 13 to 16 or 17 depending on their stream, and proceed to post-secondary education, be it Polytechnic, Junior College (JC) or Institute of Technical Education (ITE) or they might start working. 

This data set I obtained from data gov Singapore consist a set of primary school student’s data: 

- Year: The year the student educate in  primary 1

- Race: The race of the student

- Percentage of P1 Cohort, the percent of the cohort that went on to post-secondary education. 

This short analysis is to provide a regression analysis to predict the percentage of the cohort’s entering post-education. 

We first import the data into Python. There are 120 rows in this data with 3 columns. Each year has 5 rows, 4 consisting data with individual race and 1 with the overall (Race= ‘All’). I will split the data into 2 groups, one with only Race=’All’ and 1 without it. I will then perform Analysis on each group. 

Race=’All’

This data shows the year where the student educate in primary 1, there is no race variable involve. I will first plot a graph to show the relationship between the year and the percentage of P1 cohort going into Post-Secondary Education.








From the graph, we can clearly see that as the year goes by, the percentage of cohort going to Post-Secondary Education is higher. This could be due to the improvement of technology as well as the increase of competition in the job market, parents would be worried for their children if they do not possess a high qualification, so compare 2018 to 1995 the mind set of parents towards education is very different.

I will perform regression next on this 2 variable. 














We obtain the formula of

y=-2226.76844+1.1539x_1

From the p value of the constant as well as variable year, both being less than 0.05, so both supposed to be significant. However, one can see the issue with the formula and that is that for any x1 value above 2017, y will be more than 100, and theoretically it’s not possible so we will assume to be 100%. The R-square and adjusted R-square are both acceptable too, so the issue here tells us that just using year alone to predict the percentage of cohort entering post-secondary is not enough. We will need to have more variables. 

Excluding Race=’All’

So next we perform regression with the other data set that includes the Race of the participants. There are some missing data for percentage of cohort, since that is the output, replacing it with the median or mean would affect our result, so I will remove data with those output as NA.

I will just go directly into the regression model since we have already have establish the relation between year and percentage of cohort going post-secondary education. Since the race variable is a categorical variable, dummy variables are replaced for them.



























We obtain the formula of

y=-2855.0770+1.4685x_1-9.5083x_21-12.0542x_22-7.2690x_23  

where x_1=year,x_21=1 if individual is Indian,x_22=1 if individual is Malay

x_23=1 if individual is others and x_21=x_22=x_23=0 if individual is Chinese

The p value of the constant as well as the variables are all less than the alpha value of 0.05, the R-square value and the adjusted R-Square is also very high. This model should be good to use, and if so, it shows that the year and the races does impact the percentage of cohort that went onto post-secondary education.

The limitation with this model is that if the year is more than 2012, the output will be more than 100%.

Assumptions of Linear Regression Model



Linearity – Since this model is a linear model, we can safely assume this.

No omitted Bias – The sum of the error is 0.0000000000699



Normality of Error –The residuals also have constant variance.

Autocorrelation – The Durbin Watson value is 1.031, which is above 1 and below 3, but there seems to be some autocorrelation present.

Multi collinearity – Not present since both variables are not the same category.

Limitations

Based on the assumptions and the analysis, the model is limited, more variables need to be included to observe a more accurate result such as the individual’s academic score, their co-curriculum activities, the stream they were in for secondary score and more. Just purely determine the output with only race and year is very limited, which causes the output of future year to exceed 100% making the model redundant.

Knowing the sample size of each race also help to increase the accuracy as well, there are a lot of missing variables for the model, so despite the high r-square value, the model is not deem to be useful.

No solid conclusion can be made from the model.

 

 


Comments

Popular posts from this blog

Introduction