For R beginners Lesson 3. "Basic Statistical Analysis"

2024年8月25日 22:29

3. Basic Statistical Analysis in R

In this section, we will cover some basic statistical analyses that you can perform using R. These analyses include descriptive statistics, hypothesis testing, correlation analysis, and regression analysis. Each of these analyses helps us understand our data better and make informed decisions based on the data.

3.1 Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. Common descriptive statistics include measures of central tendency (mean, median, mode) and measures of spread (range, variance, standard deviation).

Example: Calculating Descriptive Statistics

# Sample data: Heights of a group of people
heights <- c(160, 172, 168, 177, 181, 169, 174, 163, 158, 170)

# Mean (average) of heights
mean_height <- mean(heights)
print(mean_height)  # Output: [1] 169.2

# Median of heights
median_height <- median(heights)
print(median_height)  # Output: [1] 169.5

# Standard deviation of heights
sd_height <- sd(heights)
print(sd_height)  # Output: [1] 7.315129

# Variance of heights
var_height <- var(heights)
print(var_height)  # Output: [1] 53.51111

# Minimum and maximum heights
min_height <- min(heights)
print(min_height)  # Output: [1] 158

max_height <- max(heights)
print(max_height)  # Output: [1] 181

In this example, we calculate various descriptive statistics for a vector of heights. These statistics give us a summary of the data, such as the average height, the middle value (median), the amount of variability in heights (standard deviation and variance), and the range of heights.

3.2 Handling Missing Data

When working with datasets, you often encounter missing values (NA). It's important to handle these missing values correctly to avoid inaccuracies in your analysis.

Removing Rows and Columns with Missing Data

To remove rows or columns that contain NA values, we use the na.omit() function or logical indexing with is.na().

Example Data Frame with Missing Data

# Creating a data frame with some missing values
my_data <- data.frame(
  Name = c("Alice", "Bob", NA, "David"),
  Age = c(25, NA, 30, 22),
  Salary = c(50000, 45000, 60000, NA)
)

print(my_data)
# Output:
#     Name Age Salary
# 1  Alice  25  50000
# 2    Bob  NA  45000
# 3   <NA>  30  60000
# 4  David  22     NA

Removing Rows with NA Values

To remove all rows containing any NA values, use the na.omit() function:

# Removing rows with any NA values
cleaned_data <- na.omit(my_data)

print(cleaned_data)
# Output:
#    Name Age Salary
# 1 Alice  25  50000

Removing Rows Based on Specific Conditions

You can also remove rows where specific columns contain NA values using logical indexing.
The ! operator in R means "not" and is used to reverse a logical condition. For example, !is.na(my_data$Age) selects rows where the Age column is not NA.

# Removing rows where the 'Age' column has NA values
cleaned_data <- my_data[!is.na(my_data$Age), ]

print(cleaned_data)
# Output:
#    Name Age Salary
# 1 Alice  25  50000
# 3   <NA>  30  60000
# 4 David  22     NA

3.3 Combining Data Frames with rbind()

The rbind() function in R is used to combine two or more data frames by rows. This function is particularly useful when you have datasets with the same columns and want to stack them on top of each other.

Example: Combining Data Frames

# Creating two data frames
data1 <- data.frame(
  Name = c("Alice", "Bob"),
  Age = c(25, 30)
)

data2 <- data.frame(
  Name = c("Charlie", "David"),
  Age = c(35, 40)
)

# Combining data frames by rows
combined_data <- rbind(data1, data2)

print(combined_data)
# Output:
#      Name Age
# 1   Alice  25
# 2     Bob  30
# 3 Charlie  35
# 4   David  40

By mastering these basic data handling techniques, including removing missing data and combining datasets, you can ensure your data is clean and ready for analysis, which is crucial for accurate statistical computations.

3.4 Hypothesis Testing

Hypothesis testing is a method used to determine if there is enough evidence in a sample of data to infer that a certain condition is true for the entire population. A common hypothesis test is the t-test, which compares the means of two groups to see if they are significantly different.

Example: Performing a t-test

# Sample data: Weights of two different groups
group1 <- c(55, 60, 65, 58, 62)
group2 <- c(68, 70, 72, 75, 69)

# Perform a two-sample t-test
t_test_result <- t.test(group1, group2)

# Print the t-test result
print(t_test_result)

# Output:
# 
# 	Welch Two Sample t-test
# 
# data:  group1 and group2
# t = -5.1255, df = 7.3138, p-value = 0.001189
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -15.739562  -5.860438
# sample estimates:
# mean of x mean of y 
#       60.0      70.8

In this example, we use a two-sample t-test to compare the means of two groups (group1 and group2). The t.test() function calculates the t-statistic and the p-value. A small p-value (typically less than 0.05) indicates that there is strong evidence against the null hypothesis, suggesting that the means of the two groups are significantly different.

3.5 Correlation Analysis

Correlation analysis measures the strength and direction of the linear relationship between two variables. The most common measure of correlation is Pearson's correlation coefficient.

Example: Calculating Pearson Correlation

# Sample data: Heights and weights of a group of people
heights <- c(160, 172, 168, 177, 181, 169, 174, 163, 158, 170)
weights <- c(55, 65, 59, 73, 78, 63, 70, 58, 54, 64)

# Calculate Pearson correlation coefficient
correlation <- cor(heights, weights)
print(correlation)  # Output: [1] 0.9755095

In this example, we calculate the Pearson correlation coefficient between two variables: heights and weights. The cor() function computes the correlation, and the result indicates a strong positive relationship between height and weight (0.95).

3.6 Regression Analysis

Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. The simplest form of regression is linear regression, which models the relationship using a straight line.

Example: Performing Linear Regression

# Sample data: Heights and weights of a group of people
heights <- c(160, 172, 168, 177, 181, 169, 174, 163, 158, 170)
weights <- c(55, 65, 59, 73, 78, 63, 70, 58, 54, 64)

# Create a data frame for regression analysis
df <- data.frame(heights, weights)

# Perform linear regression
model <- lm(weights ~ heights, data = df)

# Print the summary of the regression model
summary(model)

# Output:
# 
# Call:
# lm(formula = weights ~ heights, data = df)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -3.6412 -0.7270  0.6773  1.0280  1.8488  
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -113.59136   14.16134  -8.021 4.28e-05 ***
# heights        1.04900    0.08363  12.544 1.53e-06 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.835 on 8 degrees of freedom
# Multiple R-squared:  0.9516,	Adjusted R-squared:  0.9456 
# F-statistic: 157.4 on 1 and 8 DF,  p-value: 1.528e-06

In this example, we perform a simple linear regression to model the relationship between heights (independent variable) and weights (dependent variable). The lm() function fits the linear model, and the summary() function provides detailed information about the regression, including the coefficients, standard errors, and p-values.