For R beginners Lesson 2. "Data Manipulation"

2024年8月25日 22:19

2. Data Manipulation

Data manipulation is a key part of data analysis. In R, the dplyr package provides a set of functions that make data manipulation easier and more intuitive. To use dplyr, you'll need to install it if you haven't already, and then load it into your R session.

2.1 Installing and Loading Libraries

Before using any functions from a package like dplyr, you must install the package and load it into your R session:

# Installing the dplyr package (only need to do this once)
install.packages("dplyr")

# Loading the dplyr package
library(dplyr)

2.2 Selecting Columns

To select specific columns from a data frame, use the select() function:

# Sample data frame
my_data <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Score = c(85, 90, 95)
)

# Selecting the 'Name' and 'Score' columns
selected_data <- select(my_data, Name, Score)
print(selected_data)

# Output:
#      Name Score
# 1   Alice    85
# 2     Bob    90
# 3 Charlie    95

2.3 Filtering Rows

To filter rows based on specific conditions, use the filter() function:

# Filtering rows where Age is greater than 25
filtered_data <- filter(my_data, Age > 25)
print(filtered_data)

# Output:
#      Name Age Score
# 1     Bob  30    90
# 2 Charlie  35    95

2.4 Adding New Columns

To add new columns or modify existing ones, use the mutate() function:

# Adding a new column 'Passed' based on the 'Score'
mutated_data <- mutate(my_data, Passed = Score > 80)
print(mutated_data)

# Output:
#      Name Age Score Passed
# 1   Alice  25    85   TRUE
# 2     Bob  30    90   TRUE
# 3 Charlie  35    95   TRUE

2.5 Grouping and Summarizing Data

Grouping and summarizing data are common tasks when you want to analyze subsets of your data. In R, these tasks can be accomplished using the dplyr package, which provides functions like group_by() and summarize().

2.5.1 Grouping Data

Grouping data means organizing data into groups based on one or more variables. This is often a precursor to summarizing or aggregating data within each group.

Using group_by()

The group_by() function in dplyr is used to group data by one or more variables.

Here’s an example:

# Sample data frame
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  Gender = c("Female", "Male", "Male", "Male", "Female"),
  Age = c(25, 30, 35, 40, 28),
  Height = c(165, 180, 175, 170, 160)
)

# Grouping by the 'Gender' column without using pipe
grouped_data <- group_by(df, Gender)

print(grouped_data)

# Output:
# # A tibble: 5 × 4
# # Groups:   Gender [2]
#   Name    Gender   Age Height
#   <chr>   <chr>  <dbl>  <dbl>
# 1 Alice   Female    25    165
# 2 Eve     Female    28    160
# 3 Bob     Male      30    180
# 4 Charlie Male      35    175
# 5 David   Male      40    170

In this example, group_by(Gender) groups the data by the Gender column. This does not change the data itself but prepares it for further operations like summarizing.

2.5.2 Summarizing Data

Summarizing data involves calculating summary statistics (like mean, sum, etc.) for each group.

Using summarize()

The summarize() function in dplyr is used to create summary statistics for each group. When used with group_by(), it allows you to perform aggregations within each group.

Here’s how you can summarize data after grouping:

# Summarizing the data by calculating the mean age and mean height for each gender
summary_data <- df %>%
  group_by(Gender) %>%
  summarize(
    mean_age = mean(Age),
    mean_height = mean(Height)
  )

print(summary_data)

# Output:
# # A tibble: 2 × 3
#   Gender mean_age mean_height
#   <chr>     <dbl>       <dbl>
# 1 Female     26.5        162.5
# 2 Male       35          175

In this example, summarize() is used to calculate the mean age and mean height for each gender. The group_by(Gender) function specifies that the summarization should be done for each gender group.

2.6 Piping in R

The %>% operator, known as the "pipe" operator, is a powerful tool in R for chaining multiple functions together. It allows you to pass the output of one function directly into the next function without needing intermediate variables, making your code more readable and concise.

Without Using Pipes

Without using pipes, you need to create intermediate variables at each step of your data manipulation:

# Sample data frame
grades <- data.frame(
  Student = c("Alice", "Bob", "Alice", "Bob", "Alice", "Bob"),
  Subject = c("Math", "Math", "Science", "Science", "History", "History"),
  Score = c(88, 92, 95, 85, 80, 78)
)

# Step 1: Group by 'Student'
grouped_data <- group_by(grades, Student)

# Step 2: Summarize the average score for each student
average_scores <- summarize(grouped_data, Average_Score = mean(Score))

print(average_scores)

# Output:
# # A tibble: 2 × 2
#   Student Average_Score
#   <chr>           <dbl>
# 1 Alice            87.7
# 2 Bob              85.0

In the code above, you first group the data by Student and store the result in grouped_data. Then, you use summarize() on grouped_data to calculate the average scores, storing the result in another variable average_scores.

Using Pipes

Using pipes (%>%), you can accomplish the same task without creating intermediate variables:

# Using pipes to group by 'Student' and calculate the average score
average_scores <- grades %>%
  group_by(Student) %>%
  summarize(Average_Score = mean(Score))

print(average_scores)

# Output:
# # A tibble: 2 × 2
#   Student Average_Score
#   <chr>           <dbl>
# 1 Alice            87.7
# 2 Bob              85.0

The pipe operator %>% passes the result of one function directly to the next. The grades data frame is first passed to group_by(Student), and the output is then passed to summarize(Average_Score = mean(Score)). This eliminates the need for intermediate variables, making the code more compact and easier to read.

この記事が気に入ったらサポートをしてみませんか？