(1) Describe an interesting applied statistics problem that you have worked on.
The project I am currently working on -- Empire State Pool is an interesting applied statistics project.
The Empire State Poll (ESP) is the first of its kind annual general survey of adults, age 18 and over, who are residents of New York State. The Empire State Poll is conducted by the Cornell University’s Survey
Research Institute in the spring of each year. The first ESP was conducted in 2003.
The objective is to identify and characterize the changing attitudes and concerns of the New
York state residents over the past 13 years.
I expect to explore the data further based on demographic variables, e.g. downstate/upstate, gender, race, age, household income,
…show more content…
B. What practical concerns were there for dealing with the data: did you know how it was collected?, were there dirty data problems?, was there sampling bias?, etc.
All ESP surveys are conducted using a Computer Assisted Telephone Interviewing (CATI) software system.
The survey sample consists of a randomly selected households within New York State splitting between Upstate and Downstate residents. The sample selection procedures ensure that every household within New York State has an equal chance to be included in the survey. The sample cause the overall ESP results to vary by more than 3.5 percentage points from the answers that would be obtained if all New York state residents were interviewed.
Some issues of the data is some questions (variables of dataset) of the questionnaire are not quantitative and the data has amount of missing data.
C. How did you choose the methodology for the problem? (i.e was it industry standard?, was it tailored for this data?, etc.)
Since some data is not quantitative, I have to code the data into categorical data and do some regression analysis to see what impact households income, or conduct logistic regression to check is the family better off in the President Obama’s term compare to President Bush’s
…show more content…
I am familiar to use SAS, R, Python, Excel (VBA, PivotTable), Tableau to analyze and visualize the data.
I have the SAS Certified Advanced Programmer for SAS 9 Credential
(4) What would you consider to be your quantitative expertise and interest? (i.e. data mining, machine learning, time series analysis, experiment design, operations research etc.)
I am extremely interested in data mining and machine learning. I have done one project which applied supervised and unsupervised learning algorithms such as Best Subset Selection, Random Forest, Lasso and Ridge Regression, Dimension Reduction to do model selection and perform accurate predictive performance. Also I am currently taking “Machine Learning for Data Science” class this semester at Cornell and will have two project post on Kaggle by the end of semester.
(5) Please list your top 5 technical skills (programming languages, etc.) and rate each one as basic, intermediate or advanced.
R - Advanced
SAS - Advanced
SQL - Advanced
Python - Intermediate
Hadoop - Basic
(6) Which of Google 's products do you find most interesting (please be brief)?
Gmail - Spam