收藏文章楼主

【3】R学笔记：数据合并方法

版块：学术科研类型：普通作者：阿杰哥查看：233 回复：0 获赞：0 时间：2026-01-06 23:42:27

当你有多张表，它们有**相同的ID列**但**不同的特征列**时，需要合并它们。这是数据整合的常见需求。

以下是几种合并方法，我按常见程度排序：

## 一、**准备示例数据**

假设你有3张表：

```r

library(data.table)

# 表1：客户基本信息

customers <- data.table(

customer_id = c(101, 102, 103, 104, 105),

name = c("张三", "李四", "王五", "赵六", "钱七"),

age = c(25, 30, 28, 35, 32)

)

# 表2：客户消费信息

purchases <- data.table(

customer_id = c(101, 102, 103, 104, 106), # 注意：106不在customers中

total_amount = c(5000, 8000, 3000, 12000, 6000),

order_count = c(5, 8, 3, 12, 6)

)

# 表3：客户联系信息

contacts <- data.table(

customer_id = c(101, 102, 104, 105, 107), # 注意：103缺失，107不在customers中

phone = c("13800138001", "13800138002", "13800138004", "13800138005", "13800138007"),

email = c("zhang@test.com", "li@test.com", "zhao@test.com", "qian@test.com", "sun@test.com")

)

print("表1：客户基本信息")

print(customers)

print("\n表2：消费信息")

print(purchases)

print("\n表3：联系信息")

print(contacts)

```

## 二、**合并方法对比**

| 方法 | 功能 | 相当于SQL | 特点 |

|------|------|-----------|------|

| **cbind()** | 列合并 | - | 需要行顺序一致 |

## 三、**方法1：使用 merge()（最常用）**

### 1. **内连接**（只保留两边都有的ID）

```r

# 合并customers和purchases（只保留两个表都有的客户）

inner_merged <- merge(customers, purchases, by = "customer_id")

print("内连接结果（只保留两个表都有的ID）：")

print(inner_merged)

# 结果：101, 102, 103, 104（106被排除）

```

### 2. **左连接**（以左边表为基准）

```r

# 以customers为准，保留所有客户

left_merged <- merge(customers, purchases, by = "customer_id", all.x = TRUE)

print("\n左连接结果（保留customers所有行）：")

print(left_merged)

# 结果：101-105都有，purchases没有的显示NA

```

### 3. **全连接**（保留所有ID）

```r

# 保留所有表的ID

full_merged <- merge(customers, purchases, by = "customer_id", all = TRUE)

print("\n全连接结果（保留所有ID）：")

print(full_merged)

# 结果：101-106都有，缺失的显示NA

```

### 4. **合并多个表**

```r

# 方法1：逐步合并

merged_step1 <- merge(customers, purchases, by = "customer_id", all.x = TRUE)

merged_final <- merge(merged_step1, contacts, by = "customer_id", all.x = TRUE)

print("逐步合并结果：")

print(merged_final)

# 方法2：使用Reduce一次性合并多个表

tables_list <- list(customers, purchases, contacts)

multi_merged <- Reduce(function(x, y) merge(x, y, by = "customer_id", all = TRUE), tables_list)

print("\n使用Reduce合并多个表：")

print(multi_merged)

```

## 四、**方法2：使用 data.table 的 `[ ]` 语法（最快）**

### 先设置键（重要！）

```r

# 为每个表设置相同的键

setkey(customers, customer_id)

setkey(purchases, customer_id)

setkey(contacts, customer_id)

# 左连接：customers + purchases

dt_left <- purchases[customers] # 注意顺序！purchases在前表示以purchases为基准

print("purchases[customers] - 以purchases为基准：")

print(dt_left)

# 更常用的：customers在前（以customers为基准）

dt_left2 <- customers[purchases]

print("\ncustomers[purchases] - 以customers为基准：")

print(dt_left2)

# 实际中我们通常想要以customers为基准的左连接

# 正确的左连接（以customers为基准）

correct_left <- purchases[customers] # purchases在前，customers在后

print("\n正确的左连接（purchases在前）：")

print(correct_left)

```

### 记忆技巧

```r

# X[Y] 表示：以X为基准，从Y中获取匹配的数据

# 相当于SQL的：SELECT * FROM X LEFT JOIN Y ON X.id = Y.id

# 所以：

# 要以customers为基准：其他表[customers]

# 要以purchases为基准：其他表[purchases]

```

### 合并多个表

```r

# 合并三个表：以customers为基准

merged_dt <- contacts[purchases[customers]]

print("\ncontacts[purchases[customers]] - 链式合并：")

print(merged_dt)

# 或者分步更清晰

step1 <- purchases[customers] # customers + purchases

step2 <- contacts[step1] # 加上contacts

print("\n分步合并：")

print(step2)

```

## 五、**方法3：使用 cbind()（仅当行顺序完全一致时）**

```r

# 注意：这要求行数相同且顺序一致！不推荐用于按ID合并

# 创建顺序一致的数据

dt1 <- data.table(id = 1:5, value1 = letters[1:5])

dt2 <- data.table(id = 1:5, value2 = 11:15) # id顺序相同

# 直接列合并

cbind_result <- cbind(dt1, dt2[, -"id"]) # 去掉dt2的id列避免重复

print("cbind合并（顺序一致时）：")

print(cbind_result)

```

## 六、**处理实际问题：你的情况**

假设你有：

- 表A：`id, name, age`

- 表B：`id, income, education`

- 表C：`id, address, phone`

### 推荐做法：

```r

library(data.table)

# 1. 读取你的数据（替换为实际文件路径）

# table1 <- fread("表1.csv")

# table2 <- fread("表2.csv")

# table3 <- fread("表3.csv")

# 2. 为演示创建示例数据

table1 <- data.table(

id = c(1, 2, 3, 4, 5),

name = c("张三", "李四", "王五", "赵六", "钱七"),

age = c(25, 30, 28, 35, 32)

)

table2 <- data.table(

id = c(1, 2, 3, 6, 7), # 注意：4,5缺失，6,7新增

income = c(5000, 8000, 6000, 9000, 7000),

education = c("本科", "硕士", "博士", "本科", "硕士")

)

table3 <- data.table(

id = c(1, 3, 4, 5, 8), # 注意：2缺失，8新增

address = c("北京", "上海", "广州", "深圳", "杭州"),

phone = c("13800138001", "13800138003", "13800138004", "13800138005", "13800138008")

)

print("你的三个表：")

print(table1)

print(table2)

print(table3)

# 3. 方法A：使用merge（最直观）

final_table <- merge(table1, table2, by = "id", all.x = TRUE)

final_table <- merge(final_table, table3, by = "id", all.x = TRUE)

print("\n方法A：使用merge（左连接，以table1为基准）")

print(final_table)

# 4. 方法B：使用data.table语法（更高效）

setkey(table1, id)

setkey(table2, id)

setkey(table3, id)

final_table_dt <- table3[table2[table1]]

print("\n方法B：使用data.table [ ] 语法")

print(final_table_dt)

# 5. 方法C：保留所有ID（全连接）

all_ids <- Reduce(function(x, y) merge(x, y, by = "id", all = TRUE),

list(table1, table2, table3))

print("\n方法C：全连接（保留所有ID）")

print(all_ids)

```

## 七、**处理重复ID和列名冲突**

### 1. 有重复ID的情况

```r

# 表中有重复ID

dup_table1 <- data.table(

id = c(1, 1, 2, 3), # id=1重复

value = c("A", "B", "C", "D")

)

dup_table2 <- data.table(

id = c(1, 2, 2, 3), # id=2重复

score = c(90, 85, 88, 92)

)

# 合并时会自动处理所有组合

merged_dup <- merge(dup_table1, dup_table2, by = "id")

print("有重复ID的合并结果：")

print(merged_dup)

# id=1：A-90 和 B-90

# id=2：C-85 和 C-88

```

### 2. 列名冲突处理

```r

# 两个表有相同列名（非ID列）

conflict1 <- data.table(

id = 1:3,

value = c("X", "Y", "Z"),

common_col = c(10, 20, 30) # 冲突列

)

conflict2 <- data.table(

id = 2:4,

value = c("A", "B", "C"),

common_col = c(200, 300, 400) # 冲突列

)

# 合并时自动添加后缀

merged_conflict <- merge(conflict1, conflict2, by = "id",

suffixes = c("_table1", "_table2"))

print("\n列名冲突处理（自动加后缀）：")

print(merged_conflict)

```

## 八、**实用函数封装**

```r

# 封装一个合并多个表的函数

merge_multiple_tables <- function(table_list, id_col = "id", join_type = "left") {

# table_list: 表列表

# id_col: ID列名

# join_type: "left", "inner", "full"

if (length(table_list) < 2) {

return(table_list[[1]])

}

# 确定合并参数

if (join_type == "left") {

all_arg <- TRUE

all.x_arg <- TRUE

all.y_arg <- FALSE

} else if (join_type == "inner") {

all_arg <- FALSE

all.x_arg <- FALSE

all.y_arg <- FALSE

} else if (join_type == "full") {

all_arg <- TRUE

all.x_arg <- TRUE

all.y_arg <- TRUE

}

# 使用Reduce合并

result <- Reduce(function(x, y) {

merge(x, y, by = id_col, all = all_arg, all.x = all.x_arg, all.y = all.y_arg)

}, table_list)

return(result)

}

# 使用示例

tables <- list(table1, table2, table3)

merged_result <- merge_multiple_tables(tables, id_col = "id", join_type = "left")

print("使用封装函数合并：")

print(merged_result)

```

## 九、**选择建议**

1. **初学者**：用 `merge()`，语法直观，类似SQL

2. **大数据**：用 `data.table` 的 `[ ]` 语法，速度最快

3. **简单情况**：行顺序一致时可用 `cbind()`

4. **推荐流程**：

```r

# 1. 确定以哪个表为基准（通常是最全的维度表）

# 2. 设置键：setkey(dt, id)

# 3. 使用左连接：base_table[other_table]

# 4. 或使用merge：merge(base_table, other_table, by="id", all.x=TRUE)

```

## **你的具体问题**

如果你告诉我：

1. 你有几个表？

2. 每个表大概有多少行？

3. ID列的名称是否相同？

4. 你想保留所有ID还是只保留共有的？

我可以给你更具体的代码示例！

梦一开始的时候就没那么远

回复列表

默认热门正序倒序

首 1 尾

暂无用户组

退出

等级：0级

积分：

游客：

后台控制面板