# **Data.table 完全指南**
`data.table` 是R中处理表格数据的高性能包,语法简洁且速度极快。特别适合大数据处理。
## 一、**安装与基础**
### 1. 安装与加载
```r
install.packages("data.table")
library(data.table)
```
### 2. 创建 data.table
```r
# 方法1:从data.frame转换
DT <- as.data.table(mtcars)
# 方法2:直接创建
DT <- data.table(
id = 1:5,
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
score = c(85, 92, 78, 88, 95),
group = c("A", "B", "A", "B", "A")
)
# 方法3:使用 .() 语法(更data.table风格)
DT <- data.table(
id = 1:5,
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
score = c(85, 92, 78, 88, 95),
group = c("A", "B", "A", "B", "A")
)
```
## 二、**核心语法:DT[i, j, by]**
这是data.table的核心结构,对应SQL的:
- **i** = WHERE (行选择)
- **j** = SELECT (列操作)
- **by** = GROUP BY (分组)
### 基础示例
```r
# 创建示例数据
set.seed(123)
sales <- data.table(
order_id = 1:100,
customer_id = sample(1:20, 100, replace = TRUE),
product = sample(c("A", "B", "C"), 100, replace = TRUE),
amount = round(runif(100, 10, 500), 2),
date = as.Date("2023-01-01") + sample(0:365, 100, replace = TRUE)
)
# 查看数据
sales[]
print(sales)
str(sales)
```
## 三、**行选择 (i)**
### 1. 按条件筛选
```r
# 选择金额大于300的行
sales[amount > 300]
# 多条件筛选
sales[amount > 300 & product == "A"]
# 选择特定行号
sales[1:10] # 前10行
sales[c(1, 3, 5)] # 第1,3,5行
```
### 2. 使用 %in% 筛选
```r
# 选择产品A或B
sales[product %in% c("A", "B")]
# 选择特定客户
selected_customers <- c(5, 10, 15)
sales[customer_id %in% selected_customers]
```
### 3. 使用 .I 获取行号
```r
# 获取满足条件的行号
row_nums <- sales[amount > 400, .I]
print(row_nums)
# 直接获取行号
sales[amount > 400, which = TRUE]
```
## 四、**列操作 (j)**
### 1. 选择列
```r
# 选择单列(返回向量)
sales[, amount]
# 选择多列(返回data.table)
sales[, .(amount, product)] # .() 是 list() 的简写
sales[, c("amount", "product")] # 使用字符串
sales[, list(amount, product)] # 使用list()
# 选择列并重命名
sales[, .(销售额 = amount, 产品 = product)]
```
### 2. 计算新列
```r
# 创建新列
sales[, discount_amount := amount * 0.9] # 添加新列(原地修改)
sales[, c("tax", "total") := .(amount * 0.1, amount * 1.1)] # 添加多列
# 条件赋值
sales[, category := ifelse(amount > 300, "高价", "低价")]
# 使用 := 但不显示结果(静默添加)
invisible(sales[, discount := amount * 0.8])
```
### 3. 聚合计算
```r
# 简单聚合
sales[, .(平均金额 = mean(amount))]
sales[, .(总金额 = sum(amount), 订单数 = .N)] # .N 计数
# 多指标聚合
sales[, .(
总金额 = sum(amount),
平均金额 = mean(amount),
最大金额 = max(amount),
最小金额 = min(amount),
订单数 = .N,
客户数 = uniqueN(customer_id) # 去重计数
)]
```
## 五、**分组操作 (by)**
### 1. 基本分组
```r
# 按产品分组汇总
sales[, .(总金额 = sum(amount), 订单数 = .N), by = product]
# 按多列分组
sales[, .(总金额 = sum(amount)), by = .(product, category)]
# 分组后排序
sales[, .(总金额 = sum(amount)), by = product][order(-总金额)]
```
### 2. 分组后添加列
```r
# 计算每个产品的占比
total_sales <- sales[, sum(amount)]
sales[, 占比 := amount / total_sales * 100]
# 按产品计算占比
sales[, 产品内占比 := amount / sum(amount) * 100, by = product]
```
### 3. 分组键(keyby)
```r
# keyby 会自动排序结果
sales[, .(总金额 = sum(amount)), keyby = product]
# 多列排序
sales[, .(总金额 = sum(amount)), keyby = .(product, category)]
```
## 六、**高级功能**
### 1. 设置键(setkey) - 极大提高速度
```r
# 设置键(类似数据库索引)
setkey(sales, customer_id) # 单键
setkey(sales, customer_id, date) # 复合键
# 使用键进行快速筛选
sales[.(5)] # 快速获取客户5的所有订单
sales[.(c(5, 10))] # 获取客户5和10的订单
sales[.(5, "2023-06-01")] # 使用复合键
# 检查键
key(sales)
```
### 2. 连接(join)操作
```r
# 创建客户表
customers <- data.table(
customer_id = 1:20,
name = paste0("客户", 1:20),
region = sample(c("North", "South", "East", "West"), 20, replace = TRUE)
)
# 设置键
setkey(sales, customer_id)
setkey(customers, customer_id)
# 内连接
result <- sales[customers, nomatch = 0]
# 左连接(默认)
result <- customers[sales]
# 右连接
result <- sales[customers]
# 全连接
result <- merge(sales, customers, all = TRUE)
```
### 3. 滚动连接(rolling join)
```r
# 创建时间序列数据
prices <- data.table(
date = as.Date(c("2023-01-01", "2023-01-05", "2023-01-10")),
price = c(100, 102, 105)
)
transactions <- data.table(
date = as.Date(c("2023-01-02", "2023-01-06", "2023-01-08")),
amount = c(10, 20, 15)
)
setkey(prices, date)
setkey(transactions, date)
# 向前滚动:使用之前最近的价格
transactions[prices, roll = TRUE]
```
## 七、**实用技巧**
### 1. 链式操作(chaining)
```r
# 使用 %>% 或 .[]
sales[
amount > 100
][
, .(总金额 = sum(amount)),
by = product
][
总金额 > 1000
][
order(-总金额)
]
```
### 2. 使用 .SD 和 .SDcols
```r
# .SD 代表每个分组的数据子集
# 对每个产品计算统计量
sales[, lapply(.SD, mean), by = product, .SDcols = "amount"]
# 对数值列进行标准化
numeric_cols <- c("amount", "discount_amount")
sales[, (numeric_cols) := lapply(.SD, scale), .SDcols = numeric_cols]
# 对所有数值列求总和
sales[, lapply(.SD, sum), .SDcols = is.numeric]
```
### 3. 快速筛选前N行
```r
# 每个产品金额最高的3个订单
sales[, .SD[order(-amount)][1:3], by = product]
# 使用 frank 快速排名
sales[, rank := frank(-amount, ties.method = "dense"), by = product]
sales[rank <= 3]
```
### 4. 数据重塑
```r
# 宽转长(melt)
long_dt <- melt(sales,
id.vars = c("order_id", "customer_id"),
measure.vars = c("amount", "discount_amount"),
variable.name = "指标",
value.name = "数值")
# 长转宽(dcast)
wide_dt <- dcast(long_dt,
order_id + customer_id ~ 指标,
value.var = "数值")
```
## 八、**性能对比示例**
```r
library(data.table)
library(dplyr)
# 创建大数据集
set.seed(123)
n <- 1e7 # 1000万行
big_dt <- data.table(
id = 1:n,
group = sample(letters, n, replace = TRUE),
value1 = rnorm(n),
value2 = runif(n)
)
big_df <- as.data.frame(big_dt)
# 性能测试
system.time({
result_dt <- big_dt[
value1 > 0,
.(mean_val = mean(value2), count = .N),
by = group
][order(-mean_val)]
})
system.time({
result_df <- big_df %>%
filter(value1 > 0) %>%
group_by(group) %>%
summarise(
mean_val = mean(value2),
count = n()
) %>%
arrange(desc(mean_val))
})
```
## 九、**常见任务示例**
### 1. 数据清洗
```r
# 处理缺失值
sales[is.na(amount), amount := 0]
# 删除重复行
unique(sales, by = c("customer_id", "date"))
# 类型转换
sales[, date := as.IDate(date)] # 更高效的数据类型
```
### 2. 时间序列分析
```r
# 按月统计
sales[, month := format(date, "%Y-%m")]
monthly_sales <- sales[, .(
总金额 = sum(amount),
订单数 = .N
), keyby = month]
# 计算环比增长
monthly_sales[, 环比增长 := 总金额 / shift(总金额, 1) - 1]
```
### 3. 创建汇总报表
```r
# 透视表
pivot_table <- dcast(
sales,
product ~ category,
value.var = "amount",
fun.aggregate = sum,
fill = 0
)
# 添加总计
pivot_table[, 总计 := rowSums(.SD), .SDcols = 2:ncol(pivot_table)]
```
## 十、**最佳实践**
1. **尽量使用 :=** 进行原地修改,避免内存复制
2. **设置键** 用于频繁的筛选和连接
3. **使用 .N, .I, .SD** 等特殊符号
4. **链式操作** 保持代码简洁
5. **注意数据类型**,如用 `IDate` 代替 `Date`
## **学习路径建议**
1. 先掌握 `DT[i, j, by]` 基本结构
2. 练习行筛选、列操作和分组
3. 学习设置键和连接操作
4. 掌握高级功能如 `.SD`、滚动连接
5. 在实际项目中应用
**练习**:尝试用data.table重写你之前用dplyr写的代码,比较两者的语法差异和性能!