In this project I will be helping the State's Board of Education analyze data on student funding and SAT scores. The goal will be to uncover trends on student's performance and funding in order to reallocate the budget appropriately.
- Data sources:
- schools_complete.csv
- students_complete.csv
- Software:
- Python 3.6.1
- Jupyter Notebook
- Pandas and NumPy Libraries
The first thing I needed to do was clean the data. This process consisted of two steps. First, I noticed that some of the students had professional prefixes and suffixes. So, I fixed it using the replace()
method as follows:
prefixes_suffixes = ["Dr. ", "Mr. ","Ms. ", "Mrs. ", "Miss ", " MD", " DDS", " DVM", " PhD"]
for word in prefixes_suffixes:
student_data_df["student_name"] = student_data_df["student_name"].str.replace(word,"")
And second, I noticed the data evidenced academic dishonesty in the results of 9th graders at Thomas High School. Thus, I replaced their math and reading scores with NaNs while keeping the rest of the data intact. I will be addressing later on how this changes affected the overall analysis.
student_data_df.loc[(student_data_df["grade"] == "9th") & (student_data_df["school_name"] == "Thomas High School"),["reading_score"]] = np.nan
student_data_df.loc[(student_data_df["grade"] == "9th") & (student_data_df["school_name"] == "Thomas High School"),["math_score"]] = np.nan
Before going into the actual analysis of the data with the NaNs for 9th graders at Thomas H.S., I expected some slight but not significant changes in the results. NaN (Not a Number) errors can be tricky. Even though they can be used to perform additions and averages, they might be a problem when trying to multiply or divide. In this case, I knew they were not going to affect my approach as they were just going to be dismissed when trying to calculate the average math and reading scores for each grade and school. Again, before going through everything I expected no major differences but we will se what happened in the results section.
- How is the district summary affected?
As we can observe, there is a slight difference in the results for the district summary. The academic dishonesty of 9th graders at Thomas High School made the overall results go down by a little. Once I replaced their results with NaNs, the average scores, passing percentages and overall passing percentages went down by no more than 0.3. In conclusion, as the district summary is such a broad analysis it was not affected significantly because the scores from the students we replaced consisted of just a small fraction of the overall analysis of 15 schools.
- How is the school summary affected?
As it can be seen, the NaN generated a significant increase in the passing percentages for math, reading, and overall. This was due to the fact that 9th graders at Thomas High School had a considerably high amount of students who didn't pass the math or reading exams. Even though there was academic dishonesty, 9th graders were making that percentage of students who passed remain very low in the 60s %. After their scores got replaced with NaN, and therefore dismissed in the analysis, the percentage of passing students increased up to the 90s % as the rest of the school performed mostly above the passing grade of 70.
- How does replacing the ninth graders’ math and reading scores affect Thomas High School’s performance relative to the other schools?
In comparison to other schools, after replacing the 9th graders with NaNs Thomas High School's results became a lot better. The average scores were still pretty similar, but the passing percentages were at the top. These are the top 5 schools of the district:
As we can see above, it reached the 2nd place out of the 15 schools in the district with a overall passing percentage of 90.63%. This meant that 90.6% of the entire high school (excluding 9th graders because of academic dishonesty) had a score above 70 for both math and reading.
- Math and reading scores by grade
First we have the math average scores per grade followed by the reading scores per grade for each school. In this case, the NaNs did not affect our results at all, they were just displayed as an error for ninth graders at Thomas High School. The rest of the data remained intact.
- Scores by school spending
These are the results categorized by the budget per student, meaning the money that the school is theoretically investing on each student. In this situation, the NaNs do not affect the results at all. This is due to the fact that there are a couple of other schools that are categorized in this same per student budget, not affecting the average scores and passing percentages considerably. The per student budget was a calculation made by dividing the school budget by the number of students in it. Thomas High School still has 1,635 students and a $1,043,130 budget making them be part of the third per student budget interval with $638 dollars.
- Scores by school size
Again, Thomas High School kept on having 1,635 students which makes them a medium sized along with other 4 more schools. By replacing the ninth graders scores with NaNs the data is almost untouched. There is no significant changes in the results for this category.
- Scores by school type
And last but not least, the results for the school type did not change at all. This was due to the fact that Thomas High School was categorized in the Charter school type along other 7 schools. Their ninth graders were just a small group relative to all the students in Charter schools, making the overall results stay untouched.
In conclusion, replacing the ninth graders' results at Thomas High School with NaNs generated 4 overall changes to the updated School District Analysis. The first one consisted on a slight alteration in the district summary. As the these students just composed a small portion of the entire population of students from 15 different schools, the averages and passing percentages were decreased by no more than 0.3 units. Moreover, we were able to see a big change in the per school summary as we were analyzing each school independently. For Thomas H.S., replacing 9th graders' scores now consisted on almost 30% of their population of students. Their passing percentages increased by a lot because 9th graders were bringing averages and passing percentages down because most of them had grades below 70. The passing percentages were on the 60s% and moved to the 90s% with the NaNs. Therefore, the top schools graph was also affected as Thomas H.S. was now in second place with an overall passing percentage of 90.6%. And lastly, the graph that was also affected by this change was the average math and reading scores per grade. In this case, the mean could not be calculated for the 9th graders in Thomas H.S. again because all of the scores were replaced with NaNs so the graph displayed the Not a Number error.