Homework Code For My Data Analysis Course, instructed by Jon Walker
Modify the Law of Large Numbers rmd file to
-
Use a uniform distribution instead of a normal distribution. The function for a uniform is runif instead of the rnorm.
-
Change the size of the sample from 10,100, and 1000 to 20, 300, and 4000.
-
Change the output so that it prints the actual sample mean instead of the difference between the sample mean and the population's mean.
-
For the biased section, change the rule to if the observation is between .6 and .7 then it is set to a missing value (NA).
-
Download the iris data set from courseweb
-
Delete the setosa iris cases
-
Keep petal width and sepal width and iris type
-
Make a dummy variable for iris type
-
Create a random number, sort the data by that random number
-
Divide the data (100 obs) into 10 groups based on their order
-
Take 9 of the ten groups and run a linear regression with petal width as a function of iris type and sepal width
-
Do that 10 times so that each group of ten obs is omitted from one of the regressions
-
Keep track of the intercept, dummy variable coefficient, and sepal width coefficient
-
Make a histogram (hist) of those three values (there will be 3 histograms with ten observations in each histogram).
Using iris dataset
-
Only using sepal length as a variable, create a 10 fold knn execution with k=5.
-
From the testing set keep track of what each observation was predicted to be.
-
Create a confusion matrix of the final results: what each of the 150 observations was predicted to be vs what it actually is.
-
Create a density plot of correct and incorrect predictions as a function of sepal length.
In homework 3, we did a 10 fold cross validation of the iris dataset. What we are going to do in homework 4, is
-
Read in the iris data set.
-
Randomly pick 50 observations to be the testing dataset.
-
Do this loop 100 times: For the remaining 100, take a sample of 100 observations (with replacement) from that 100 and build a model predicting what type of iris the observation is given its sepal width.
-
Using the model just developed, predict what is the iris type of the each of the 50 held out as the testing dataset.
-
For each observation in the testing dataset set keep track of how often, it is predicted to be each of the three iris types.
-
After you have 100 predictions for each of the 50 observations in the testing dataset, decide what iris type each observation is in the testing dataset by voting, each observation is predicted to be whichever type gets the most votes. If there is a tie randomly choose one or the other.
-
Create a confusion matrix for these 50 observations.
-
Compare the accuracy rate from this process with what you observed for homework 3.
Using iris data:
-
Create 'Cost Function': Benefits: Setosa correct 1 point, Veriscolor correct 3 points, Virginica correct 1 point Costs: Veriscolor wrong 10 points
-
Determine Best: 1, 2 and 3 variable models using Naive Bayes and some other method.
-
Create the Reuters dataset.
-
Create a smaller dataset with inspect(removeSparseTerms("your dataset name",0.6)). Be sure to do all the preprocessing: stemming, stop words, etc.
-
Do a hierarchical analysis with however many documents you end up with and plot the tree out.
-
Go ahead and use a Euclidean distance metric instead of cosine similarity if you want.