For R beginners Lesson 3. "Basic Statistical Analysis"
3. Basic Statistical Analysis in R
In this section, we will cover some basic statistical analyses that you can perform using R. These analyses include descriptive statistics, hypothesis testing, correlation analysis, and regression analysis. Each of these analyses helps us understand our data better and make informed decisions based on the data.
3.1 Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. Common descriptive statistics include measures of central tendency (mean, median, mode) and measures of spread (range, variance, standard deviation).
Example: Calculating Descriptive Statistics
# Sample data: Heights of a group of people
heights <- c(160, 172, 168, 177, 181, 169, 174, 163, 158, 170)
# Mean (average) of heights
mean_height <- mean(heights)
print(mean_height) # Output: [1] 169.2
# Median of heights
median_height <- median(heights)
print(median_height) # Output: [1] 169.5
# Standard deviation of heights
sd_height <- sd(heights)
print(sd_height) # Output: [1] 7.315129
# Variance of heights
var_height <- var(heights)
print(var_height) # Output: [1] 53.51111
# Minimum and maximum heights
min_height <- min(heights)
print(min_height) # Output: [1] 158
max_height <- max(heights)
print(max_height) # Output: [1] 181
In this example, we calculate various descriptive statistics for a vector of heights. These statistics give us a summary of the data, such as the average height, the middle value (median), the amount of variability in heights (standard deviation and variance), and the range of heights.
3.2 Handling Missing Data
When working with datasets, you often encounter missing values (NA). It's important to handle these missing values correctly to avoid inaccuracies in your analysis.
Removing Rows and Columns with Missing Data
To remove rows or columns that contain NA values, we use the na.omit() function or logical indexing with is.na().
Example Data Frame with Missing Data
# Creating a data frame with some missing values
my_data <- data.frame(
Name = c("Alice", "Bob", NA, "David"),
Age = c(25, NA, 30, 22),
Salary = c(50000, 45000, 60000, NA)
)
print(my_data)
# Output:
# Name Age Salary
# 1 Alice 25 50000
# 2 Bob NA 45000
# 3 <NA> 30 60000
# 4 David 22 NA
Removing Rows with NA Values
To remove all rows containing any NA values, use the na.omit() function:
# Removing rows with any NA values
cleaned_data <- na.omit(my_data)
print(cleaned_data)
# Output:
# Name Age Salary
# 1 Alice 25 50000
Removing Rows Based on Specific Conditions
You can also remove rows where specific columns contain NA values using logical indexing.
The ! operator in R means "not" and is used to reverse a logical condition. For example, !is.na(my_data$Age) selects rows where the Age column is not NA.
# Removing rows where the 'Age' column has NA values
cleaned_data <- my_data[!is.na(my_data$Age), ]
print(cleaned_data)
# Output:
# Name Age Salary
# 1 Alice 25 50000
# 3 <NA> 30 60000
# 4 David 22 NA
3.3 Combining Data Frames with rbind()
The rbind() function in R is used to combine two or more data frames by rows. This function is particularly useful when you have datasets with the same columns and want to stack them on top of each other.
Example: Combining Data Frames
# Creating two data frames
data1 <- data.frame(
Name = c("Alice", "Bob"),
Age = c(25, 30)
)
data2 <- data.frame(
Name = c("Charlie", "David"),
Age = c(35, 40)
)
# Combining data frames by rows
combined_data <- rbind(data1, data2)
print(combined_data)
# Output:
# Name Age
# 1 Alice 25
# 2 Bob 30
# 3 Charlie 35
# 4 David 40
By mastering these basic data handling techniques, including removing missing data and combining datasets, you can ensure your data is clean and ready for analysis, which is crucial for accurate statistical computations.
3.4 Hypothesis Testing
Hypothesis testing is a method used to determine if there is enough evidence in a sample of data to infer that a certain condition is true for the entire population. A common hypothesis test is the t-test, which compares the means of two groups to see if they are significantly different.
Example: Performing a t-test
# Sample data: Weights of two different groups
group1 <- c(55, 60, 65, 58, 62)
group2 <- c(68, 70, 72, 75, 69)
# Perform a two-sample t-test
t_test_result <- t.test(group1, group2)
# Print the t-test result
print(t_test_result)
# Output:
#
# Welch Two Sample t-test
#
# data: group1 and group2
# t = -5.1255, df = 7.3138, p-value = 0.001189
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -15.739562 -5.860438
# sample estimates:
# mean of x mean of y
# 60.0 70.8
In this example, we use a two-sample t-test to compare the means of two groups (group1 and group2). The t.test() function calculates the t-statistic and the p-value. A small p-value (typically less than 0.05) indicates that there is strong evidence against the null hypothesis, suggesting that the means of the two groups are significantly different.
3.5 Correlation Analysis
Correlation analysis measures the strength and direction of the linear relationship between two variables. The most common measure of correlation is Pearson's correlation coefficient.
Example: Calculating Pearson Correlation
# Sample data: Heights and weights of a group of people
heights <- c(160, 172, 168, 177, 181, 169, 174, 163, 158, 170)
weights <- c(55, 65, 59, 73, 78, 63, 70, 58, 54, 64)
# Calculate Pearson correlation coefficient
correlation <- cor(heights, weights)
print(correlation) # Output: [1] 0.9755095
In this example, we calculate the Pearson correlation coefficient between two variables: heights and weights. The cor() function computes the correlation, and the result indicates a strong positive relationship between height and weight (0.95).
3.6 Regression Analysis
Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. The simplest form of regression is linear regression, which models the relationship using a straight line.
Example: Performing Linear Regression
# Sample data: Heights and weights of a group of people
heights <- c(160, 172, 168, 177, 181, 169, 174, 163, 158, 170)
weights <- c(55, 65, 59, 73, 78, 63, 70, 58, 54, 64)
# Create a data frame for regression analysis
df <- data.frame(heights, weights)
# Perform linear regression
model <- lm(weights ~ heights, data = df)
# Print the summary of the regression model
summary(model)
# Output:
#
# Call:
# lm(formula = weights ~ heights, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.6412 -0.7270 0.6773 1.0280 1.8488
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -113.59136 14.16134 -8.021 4.28e-05 ***
# heights 1.04900 0.08363 12.544 1.53e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.835 on 8 degrees of freedom
# Multiple R-squared: 0.9516, Adjusted R-squared: 0.9456
# F-statistic: 157.4 on 1 and 8 DF, p-value: 1.528e-06
In this example, we perform a simple linear regression to model the relationship between heights (independent variable) and weights (dependent variable). The lm() function fits the linear model, and the summary() function provides detailed information about the regression, including the coefficients, standard errors, and p-values.