All the exercises are independent and may be done in the order you want. You may use R software at any time it may help you. You can use the result stated in a previous question even if you have not been able to prove it. The hints proof is not required. Please include all the commands you want to use (even the codes given during the lecture) in your script. At the end of the exam, send your script by email at laurent.delsol@univ-orleans.fr.
A gold-bearing river is known to have a distribution of gold nugget diameters following an \(\mathcal{E}(1)\) distribution. A gold prospector is prospecting with a sieve with meshes of unknown diameter \(\theta\). The size of a nugget harvested, denoted \(X\), then has the density \[f_\theta (x) = e^{-(x-\theta)}1_{\{x\geq\theta\}}.\] The variable size in the dataset Nuggets.RData provides the sizes of the nuggets harvested (observed values of \(X\)).
tau=1
u=seq(min(moda)-tau,max(moda)+tau,len=10000)
plot(c(min(moda)-tau,moda,max(moda)+tau),c(0,Fc,1),type='s')
lines(u,pexp(u,1),col='red')
Show that \(\mathbb{E}[X] = \theta + 1\). Hint: you can use that \(X-\theta\) has an exponential distribution with parameter 1.
Deduce by the Method of Moments an estimator \(\hat \theta_1\) of the parameter \(\theta\).
Compute the bias and variance of \(\hat \theta_1\).
Give a sufficient statistics for \(\theta\).
Show that the maximum likelihood estimator is \(\hat \theta_2= \min_{1\leq i\leq n} X_i\).
Prove that \(\hat \theta_2\) has the density \[ g_{\theta,n}(x) = ne^{-n(x-\theta)}1_{\{x\geq\theta\}}.\]
Calculate the bias and variance of \(\hat \theta_2\). Hint: the result stated in the previous question means that \(\hat \theta_2-\theta\) has an exponential distribution with parameter n.
Construct an unbiased estimator \(\hat \theta_3\) from \(\hat \theta_2\). Calculate the variance of \(\hat \theta_3\).
Compare the squared errors of \(\hat \theta_1\), \(\hat \theta_2\) and \(\hat \theta_3\). Which estimator is the best?
Compare finally the empirical cumulative distribution function
with the cumulative distribution function of \(\mathcal{E}(\hat\theta_1),
\mathcal{E}(\hat\theta_2),\) and \(\mathcal{E}(\hat\theta_3)\). Comment.
Hint: use the command
lines(u,pexp(u-theta,1),col='red')
to add the cumulative
distribution function of \(\mathcal{E}(\theta)\).
A new escape game has just opened in HCMC. The exit door only opens
if the team has solved a series of random puzzles. At each try, one of
the two paths (simple or complex) is chosen at random (with
probabilities 1/4 and 3/4 respectively). The simple path has a success
rate of \(\theta\) while the complex
path has a success rate of \(\frac{\theta}{3}\). The probability of
opening the door at each try is therefore \(p=\frac{1}{4}\theta+\frac{3}{4}\frac{\theta}{3}=\frac{\theta}{2}\).
At each failure, the game starts again from scratch and a new path is
chosen randomly.
Consequently, the number of unsuccessful attempts before opening the
door follows a \(\mathcal{G}(\frac{\theta}{2})\)
distribution given by: \[\forall k\in
\mathbb{N} ,\
\mathbb{P}(X=k)=\frac{\theta}{2}\left(1-\frac{\theta}{2}\right)^k
.\] The aim is to estimate \(\theta\in(0,1)\).
Load the Escape_Game.RData working environment which contains \(X\) values, through the variable unsuccessful_attempts.
Precise the population, the type of variable and the sample size n.
Represent the empirical distribution through a relevant graphic. Give the value of the mode.
Let’s define \(Y=1_{\{X=0\}}\) and for any \(1\leq i\leq n\), the random variable \(Y_i=1_{\{X_i= 0\}}\). Prove \(\mathbb{E}[Y]=\mathbb{P}(X= 0)=\frac{\theta}{2}\) and deduce an unbiased estimator from \(Y_1, \dots, Y_n\) using the Moments method. Compute its bias and variance.
Prove \(\mathbb{E}[X]=\frac{2-\theta}{\theta}\). Hint: \(\sum_{k=0}^\infty k(1-\frac{\theta}{2})^{k-1}=\frac{4}{\theta^2}\).
Compute the Fisher information of the sample. Is \(\hat \theta_1\) an efficient estimator of \(\theta\)?
Prove the distribution of \(X\) belongs to the exponential family and explain why \(\overline X\) is a sufficient and complete statistics.
Deduce from question 4 a new estimator \(\hat \theta_2\) of \(\theta\) from the Moments method.
Unfortunately, the bias and variance of \(\hat \theta_2\) are very difficult to compute. Consider instead \(\hat \theta_3=2\frac{1-\frac{1}{n}}{1-\frac{1}{n}+\overline X}=2\frac{n-1}{n-1+n\overline X}\). Prove \(\hat \theta_3\) is an unbiased estimator of \(\theta\). Hint: \(\forall k\in\mathbb{N},\, \mathbb{P}(X_1+\dots+X_n =k)=\frac{(n-1+k)!}{k!(n-1)!}\left(1-\frac{\theta}{2}\right)^k\left(\frac{\theta}{2}\right)^n,\) and \(\mathbb{E}[\hat \theta_3]=\sum_{k=0}^\infty 2 \frac{n-1}{n-1+k}\mathbb{P}(X_1+\dots+X_n =k)\).
Explain why there is no need to compute the variance of \(\hat \theta_3\) to prove \(\hat \theta_3\) is the best unbiased estimator of \(\theta\). Hint: use questions 7 and 9.
Compute the value of the three estimators \(\hat \theta_1\), \(\hat \theta_2\),and \(\hat \theta_3\).
This exercise is devoted to the study of the Marathon dataset. For this, you have a Marathon.RData file containing the data.
What is the nature of the data? What is the sample size? Discuss in a few lines the interest of a non-parametric approach at this stage of the study.
Give the expression of the kernel estimator of the density. Demonstrate that if the kernel K is a density, then this is also true for the estimator.
Estimate the density of the variable Time using an Epanechnikov kernel and a smoothing parameter of 0.5,5, 50 and the one chosen by cross-validation. Comment on the results obtained.
Give an estimate of the 3 main modes. Can you make a guess about the affiliations of the 3 groups of athletes thus revealed.
Give an estimate of the 2 minima m1 and m2 between the modes obtained in question (4). Use them to define 3 clusters and compare them to the affilation (variable Type). Hint: You can use the following commands
cluster=rep(1,length(Time))
cluster[(m1>=Time) & (Time<m2)]=2
cluster[(Time>=m2)]=3
table(cluster,Type)
Give the definition of the Bayes classifier. Can we use it directly here? Why? How can you obtain a classification method from kernel estimators?
Use the variables Type and Time and the function Classif_NP seen in class to perform a supervised classification of the runners’ times (according to their affiliation). Comment on the results obtained (confusion matrix, error rate).
Here are the times obtained by 7 new competitors: 131.682, 171.873, 153.045, 188.223, 170.289, 154.824, 217.248. Estimate their probable affiliation. Are you confident about these estimated affiliations?