Chapter 9 データ整備

Chapter 9.2-9.4 : tidyverseに同梱されるdplyrパッケージに含まれる関数を用いた、データの加工法を紹介
Chapter 9.5 : gtsummaryパッケージ(Sjoberg et al. 2022)を用いた記述統計表を作成

9.1 パッケージ & データ

library(tidyverse) # データ整備

library(AER) # Example データ

library(gtsummary) # 記述統計量

data("NMES1988") ## データの取得

raw <- NMES1988 ## rawという名前に変更

9.2 新しい変数の作成

mutate関数の利用

df <- 
  raw |> 
  mutate(age_2 = age^2) # 年齢の二乗項を作成

9.3 変数の限定

select関数の利用

df <- 
  raw |> 
  select(age,
         income)

特定の変数の除外

df <- 
  raw |> 
  select(-age,
         -income)

9.4 サンプルの除外

filter関数の利用

df <- 
  raw |> 
  filter(visits >= 7)

9.5 記述統計表の作成

記述統計の作成には多くの有益なパッケージが存在
ここではgtsummaryを使用
select関数で必要な変数(visits, health, medicaid)を抜き出し、insuranceごとに連続変数については中央値、カテゴリ変数については頻度を記述

raw |>  # rawを入力とし
  select(visits,
         health,
         medicaid,
         insurance
         ) |> # 必要な変数を抜き出す
  tbl_summary(by = insurance) # 記述統計を計算

Characteristic	no, N = 985¹	yes, N = 3,421¹
visits	3 (1, 7)	4 (2, 8)
health
poor	204 (21%)	350 (10%)
average	721 (73%)	2,788 (81%)
excellent	60 (6.1%)	283 (8.3%)
medicaid	341 (35%)	61 (1.8%)
¹ Median (IQR); n (%)

連続変数について、平均値と標準偏差を記述

raw |> 
  select(visits,
         health,
         medicaid,
         insurance
         ) |> 
  tbl_summary(by = insurance,
            statistic = list(all_continuous() ~ "{mean} ({sd})") # 平均と標準誤差を表示
            )

Characteristic	no, N = 985¹	yes, N = 3,421¹
visits	5 (6)	6 (7)
health
poor	204 (21%)	350 (10%)
average	721 (73%)	2,788 (81%)
excellent	60 (6.1%)	283 (8.3%)
medicaid	341 (35%)	61 (1.8%)
¹ Mean (SD); n (%)

References

Sjoberg, Daniel D., Michael Curry, Joseph Larmarange, Jessica Lavery, Karissa Whiting, and Emily C. Zabor. 2022. Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. https://CRAN.R-project.org/package=gtsummary.