Tidyverse

R for Data Science by Wickham & Grolemund

Author

Sungkyun Cho

Published

March 16, 2024

Inspecting data

함수들: print(), glimpse(), summary(), count()
() 안에 들어가는 것을 argument라고 부름

library(tidyverse)

cps <- as_tibble(mosaicData::CPS85) # mosaicData package의 CPS85 데이터셋
print(cps) # print 생략!
# A tibble: 534 x 11
   wage  educ race  sex   hispanic south married exper union   age sector  
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct>   
1   9      10 W     M     NH       NS    Married    27 Not      43 const   
2   5.5    12 W     M     NH       NS    Married    20 Not      38 sales   
3   3.8    12 W     F     NH       NS    Single      4 Not      22 sales   
4  10.5    12 W     F     NH       NS    Married    29 Not      47 clerical
5  15      12 W     M     NH       NS    Married    40 Union    58 const   
6   9      16 W     F     NH       NS    Married    27 Not      49 clerical
# i 528 more rows
print()

강의 노트에서 print()를 쓰는 것은 jupyter notebook에서 data frame을 표시하는 방식때문이므로 무시하셔도 됩니다.

보통 print()없이 데이터 프레임을 살펴보지만, print()을 이용하면, 표시되는 방식을 조정해서 볼 수 있음.

print(cps, n = 3) # 처음 3개 행
# A tibble: 534 × 11
   wage  educ race  sex   hispanic south married exper union   age sector
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct> 
1   9      10 W     M     NH       NS    Married    27 Not      43 const 
2   5.5    12 W     M     NH       NS    Married    20 Not      38 sales 
3   3.8    12 W     F     NH       NS    Single      4 Not      22 sales 
# … with 531 more rows

print(tibble, n = 10, width = Inf) # 10개의 rows와 모든 columns

기본 셋팅을 변경하려면
options(tibble.print_min = 10, tibble.width = Inf)

Columns/변수들이 많은 경우 화면에서 다음과 같이 축약되어 나오는데, 이를 다 보려면

print(nycflights13::flights) # nycflights13 패키지의 flights 데이터
# # A tibble: 336,776 × 19
#    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
# 1  2013     1     1      517         515       2     830     819      11 UA     
# 2  2013     1     1      533         529       4     850     830      20 UA     
# 3  2013     1     1      542         540       2     923     850      33 AA     
# 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
# 5  2013     1     1      554         600      -6     812     837     -25 DL     
# 6  2013     1     1      554         558      -4     740     728      12 UA     
# # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
# #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
# #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

print(nycflights13::flights, n = 3, width = Inf) # 가로 열의 개수: Inf (모든 열)
# # A tibble: 336,776 × 19
#    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
# 1  2013     1     1      517            515         2      830            819
# 2  2013     1     1      533            529         4      850            830
# 3  2013     1     1      542            540         2      923            850
#   arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
#       <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
# 1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
# 2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
# 3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
#   time_hour          
#   <dttm>             
# 1 2013-01-01 05:00:00
# 2 2013-01-01 05:00:00
# 3 2013-01-01 05:00:00
# # … with 336,773 more rows

많은 변수들을 간략히 보는 방법으로는 glimpse()

glimpse(cps)
Rows: 534
Columns: 11
$ wage     <dbl> 9.00, 5.50, 3.80, 10.50, 15.00, 9.00, 9.57, 15.00, 11.00, 5.0…
$ educ     <int> 10, 12, 12, 12, 12, 16, 12, 14, 8, 12, 17, 17, 14, 14, 12, 14…
$ race     <fct> W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, NW, NW, W,…
$ sex      <fct> M, M, F, F, M, F, F, M, M, F, M, M, M, M, M, M, M, M, M, M, F…
$ hispanic <fct> NH, NH, NH, NH, NH, NH, NH, NH, NH, NH, Hisp, NH, Hisp, NH, N…
$ south    <fct> NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, N…
$ married  <fct> Married, Married, Single, Married, Married, Married, Married,…
$ exper    <int> 27, 20, 4, 29, 40, 27, 5, 22, 42, 14, 18, 3, 4, 14, 35, 0, 7,…
$ union    <fct> Not, Not, Not, Not, Union, Not, Union, Not, Not, Not, Not, No…
$ age      <int> 43, 38, 22, 47, 58, 49, 23, 42, 56, 32, 41, 26, 24, 34, 53, 2…
$ sector   <fct> const, sales, sales, clerical, const, clerical, service, sale…
Tip

엑셀 스프레드시트처럼 보는 방법은
Environment 패널에 보이는 cps 데이터셋 맨 끝에 네모난 마크를 클릭하거나,
view(cps)

변수들에 대한 통계치 요약 summary()

summary(cps)
      wage             educ       race     sex     hispanic   south   
 Min.   : 1.000   Min.   : 2.00   NW: 67   F:245   Hisp: 27   NS:378  
 1st Qu.: 5.250   1st Qu.:12.00   W :467   M:289   NH  :507   S :156  
 Median : 7.780   Median :12.00                                       
 Mean   : 9.024   Mean   :13.02                                       
 3rd Qu.:11.250   3rd Qu.:15.00                                       
 Max.   :44.500   Max.   :18.00                                       
                                                                      
    married        exper         union          age             sector   
 Married:350   Min.   : 0.00   Not  :438   Min.   :18.00   prof    :105  
 Single :184   1st Qu.: 8.00   Union: 96   1st Qu.:28.00   clerical: 97  
               Median :15.00               Median :35.00   service : 83  
               Mean   :17.82               Mean   :36.83   manuf   : 68  
               3rd Qu.:26.00               3rd Qu.:44.00   other   : 68  
               Max.   :55.00               Max.   :64.00   manag   : 55  
                                                           (Other) : 58  

카테고리별 개수를 세주는 count()
Number(수)에 대해서도 적용 가능: ex. educ 수준 2, 3, … 18 각각에 대해서

cps |>  # pipe operator: alt + . (option + .)
    count(sector) |>
    print() # 생략해도 됨
# A tibble: 8 × 2
  sector       n
  <fct>    <int>
1 clerical    97
2 const       20
3 manag       55
4 manuf       68
5 other       68
6 prof       105
7 sales       38
8 service     83
cps |>
    count(sex, married) |>
    print()
# A tibble: 4 × 3
  sex   married     n
  <fct> <fct>   <int>
1 F     Married   162
2 F     Single     83
3 M     Married   188
4 M     Single    101
Pipe operator

|> 또는 %>% (’then’의 의미로…)

x |> f(y) # f(x, y),
x |> f(y) |> g(z) # g(f(x, y), z)

summary(cps) 는 다음과 같음

cps |>
    summary()

count(cps, sector)는 다음과 같음

cps |> 
    count(sector)

Rows

행에 적용되는 함수들
filter(), arrange(), distinct()

filter()

조건에 맞는 행을 선택

Conditional operators:
>, >=, <, <=,
== (equal to), != (not equal to)
& (and) | (or)
! (not)
%in% (includes)

# 임금(wage)가 10이상인 사람들
cps |>
    filter(wage >= 10) |>
    print()
# A tibble: 184 × 11
   wage  educ race  sex   hispanic south married exper union   age sector  
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct>   
1  10.5    12 W     F     NH       NS    Married    29 Not      47 clerical
2  15      12 W     M     NH       NS    Married    40 Union    58 const   
3  15      14 W     M     NH       NS    Single     22 Not      42 sales   
4  11       8 W     M     NH       NS    Married    42 Not      56 manuf   
5  25.0    17 W     M     Hisp     NS    Married    18 Not      41 prof    
6  20.4    17 W     M     NH       NS    Single      3 Not      26 prof    
# … with 178 more rows
# 임금(wage)가 10이상이고 여성(F)들
cps |>
    filter(wage >= 10 & sex == "F") |>
    print()
# A tibble: 62 × 11
   wage  educ race  sex   hispanic south married exper union   age sector  
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct>   
1  10.5    12 W     F     NH       NS    Married    29 Not      47 clerical
2  11.2    17 NW    F     NH       NS    Married    32 Not      55 clerical
3  25.0    17 W     F     NH       NS    Single      5 Not      28 prof    
4  12.6    17 W     F     NH       NS    Married    13 Not      36 manag   
5  11.7    16 W     F     NH       NS    Single     42 Not      64 clerical
6  12.5    15 W     F     NH       NS    Married     6 Not      27 clerical
# … with 56 more rows
# 간부급(management)과 전문직(professional)에 종사하는 사람들
cps |>
    filter(sector == "manag" | sector == "prof") |>
    print()
# A tibble: 160 × 11
   wage  educ race  sex   hispanic south married exper union   age sector
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct> 
1  25.0    17 W     M     Hisp     NS    Married    18 Not      41 prof  
2  20.4    17 W     M     NH       NS    Single      3 Not      26 prof  
3  10      16 W     M     Hisp     NS    Married     7 Union    29 manag 
4  15      16 NW    M     NH       NS    Married    26 Union    48 manag 
5  25.0    17 W     F     NH       NS    Single      5 Not      28 prof  
6  10      14 W     M     NH       NS    Married    22 Not      42 prof  
# … with 154 more rows

다음과 같이 편리하게 %in%을 이용하여 여러 항목을 포함하는, 즉 |==를 합친 조건문을 생성
즉, include인지 판별

# A shorter way to select sectors for management or professional
cps |>
    filter(sector %in% c("manag", "prof")) |>
    print()
# A tibble: 160 × 11
   wage  educ race  sex   hispanic south married exper union   age sector
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct> 
1  25.0    17 W     M     Hisp     NS    Married    18 Not      41 prof  
2  20.4    17 W     M     NH       NS    Single      3 Not      26 prof  
3  10      16 W     M     Hisp     NS    Married     7 Union    29 manag 
4  15      16 NW    M     NH       NS    Married    26 Union    48 manag 
5  25.0    17 W     F     NH       NS    Single      5 Not      28 prof  
6  10      14 W     M     NH       NS    Married    22 Not      42 prof  
# … with 154 more rows
Important

filter()로 얻은 데이터 프레임은 원래 데이터 프레임을 수정하는 것이 아니므로 계속 사용하려면 저장해야 함
이후 모든 함수들에 대해서도 마찬가지

prestige <- cps |>
    filter(sector %in% c("manag", "prof"))

prestige
#    wage  educ race  sex   hispanic south married exper union   age sector
#   <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct> 
# 1  25.0    17 W     M     Hisp     NS    Married    18 Not      41 prof  
# 2  20.4    17 W     M     NH       NS    Single      3 Not      26 prof  
# 3  10      16 W     M     Hisp     NS    Married     7 Union    29 manag 
# ...
Tip

잦은 실수들

cps |>
    filter(sex = "F") # "==" vs. "="
cps |>
    filter(sector == "manage" | "prof") # | 전후 모두 완결된 조건문 필요

arrange()

Column의 값을 기준으로 row를 정렬

# 교육정도(educ)와 임금(wage)에 따라 오름차순으로 정렬
cps |>
    arrange(educ, wage) |>
    print(n = 10)
# A tibble: 534 × 11
    wage  educ race  sex   hispanic south married exper union   age sector 
   <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct>  
 1  3.75     2 W     M     Hisp     NS    Single     16 Not      24 service
 2  7        3 W     M     Hisp     S     Married    55 Not      64 manuf  
 3  6        4 W     M     NH       NS    Married    54 Not      64 service
 4 14        5 W     M     NH       S     Married    44 Not      55 const  
 5  3        6 W     F     Hisp     NS    Married    43 Union    55 manuf  
 6  4.62     6 NW    F     NH       S     Single     33 Not      45 manuf  
 7  5.75     6 W     M     NH       S     Married    45 Not      57 manuf  
 8  3.35     7 W     M     NH       S     Married    43 Not      56 manuf  
 9  4.5      7 W     M     Hisp     S     Married    14 Not      27 service
10  6        7 W     F     NH       S     Married    15 Not      28 manuf  
# … with 524 more rows

desc()을 이용하면 내림차순으로 정렬

# educ을 내림차순으로 정렬
cps |>
    arrange(desc(educ)) |>
    print(n = 10)
# A tibble: 534 × 11
    wage  educ race  sex   hispanic south married exper union   age sector
   <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct> 
 1 15       18 W     M     NH       NS    Married    12 Not      36 prof  
 2 14.0     18 W     F     NH       NS    Married    14 Not      38 manag 
 3 13.5     18 W     M     NH       NS    Married    14 Union    38 prof  
 4 20       18 W     F     NH       NS    Married    19 Not      43 manag 
 5  7       18 W     M     NH       NS    Married    33 Not      57 prof  
 6 11.2     18 W     M     NH       NS    Married    19 Not      43 prof  
 7  5.71    18 W     M     NH       NS    Married     3 Not      27 prof  
 8 18       18 W     M     NH       NS    Married    15 Not      39 prof  
 9 19       18 W     M     NH       NS    Single     13 Not      37 manag 
10 22.8     18 W     F     NH       NS    Single     37 Not      61 prof  
# … with 524 more rows

arrange()filter()를 함께 사용하여 좀 더 복잡한 문제를 해결할 수 있음

# 높은 지위의 섹터에서 일하는 사람들 중 임금이 상위에 있는 사람들
cps |>
    filter(sector == "manage" | sector == "prof") |>
    arrange(desc(wage)) |>
    print()
# A tibble: 105 × 11
   wage  educ race  sex   hispanic south married exper union   age sector
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct> 
1  25.0    17 W     M     Hisp     NS    Married    18 Not      41 prof  
2  25.0    17 W     F     NH       NS    Single      5 Not      28 prof  
3  25.0    17 W     M     NH       NS    Married    31 Not      54 prof  
4  25.0    16 W     F     NH       S     Single      5 Not      27 prof  
5  23.2    17 NW    F     NH       NS    Married    25 Union    48 prof  
6  22.8    18 W     F     NH       NS    Single     37 Not      61 prof  
# … with 99 more rows

distinct()**

유티크한 조합들을 리스트

cps |>
    distinct(sector, sex) |>
    print()
# A tibble: 15 × 2
   sector   sex  
   <fct>    <fct>
 1 const    M    
 2 sales    M    
 3 sales    F    
 4 clerical F    
 5 service  F    
 6 manuf    M    
 7 prof     M    
 8 service  M    
 9 other    M    
10 clerical M    
11 manag    M    
12 prof     F    
13 manag    F    
14 manuf    F    
15 other    F    

Columns

열에 적용되는 함수들
mutate(), select(), rename()

mutate()

Columns/변수들로부터 값을 계산하여 새로운 변수를 만듦

tips <- as_tibble(reshape::tips) # reshpae 패키지 안에 tips 데이터셋
tips |> print()
# A tibble: 244 x 7
  total_bill   tip sex    smoker day   time    size
       <dbl> <dbl> <fct>  <fct>  <fct> <fct>  <int>
1       17.0  1.01 Female No     Sun   Dinner     2
2       10.3  1.66 Male   No     Sun   Dinner     3
3       21.0  3.5  Male   No     Sun   Dinner     3
4       23.7  3.31 Male   No     Sun   Dinner     2
5       24.6  3.61 Female No     Sun   Dinner     4
6       25.3  4.71 Male   No     Sun   Dinner     4
# i 238 more rows
tips |>
    mutate(
        tip_pct = tip / total_bill * 100,
        tip_pct_per = tip_pct / size
    ) |>
    print()
# A tibble: 244 × 9
  total_bill   tip sex    smoker day   time    size tip_pct tip_pct_per
       <dbl> <dbl> <fct>  <fct>  <fct> <fct>  <int>   <dbl>       <dbl>
1       17.0  1.01 Female No     Sun   Dinner     2    5.94        2.97
2       10.3  1.66 Male   No     Sun   Dinner     3   16.1         5.35
3       21.0  3.5  Male   No     Sun   Dinner     3   16.7         5.55
4       23.7  3.31 Male   No     Sun   Dinner     2   14.0         6.99
5       24.6  3.61 Female No     Sun   Dinner     4   14.7         3.67
6       25.3  4.71 Male   No     Sun   Dinner     4   18.6         4.66
# … with 238 more rows

select()

Columns/변수를 선택

tips |>
    select(total_bill, tip, day, time) |>
    print()
# A tibble: 244 × 4
  total_bill   tip day   time  
       <dbl> <dbl> <fct> <fct> 
1       17.0  1.01 Sun   Dinner
2       10.3  1.66 Sun   Dinner
3       21.0  3.5  Sun   Dinner
4       23.7  3.31 Sun   Dinner
5       24.6  3.61 Sun   Dinner
6       25.3  4.71 Sun   Dinner
# … with 238 more rows
# tip에서 smoker까지, 그리고 size columns 선택
tips |>
    select(tip:smoker, size) |>  # select(2:4, 7)처럼 number로 선택가능
    print()
# A tibble: 244 × 4
    tip sex    smoker  size
  <dbl> <fct>  <fct>  <int>
1  1.01 Female No         2
2  1.66 Male   No         3
3  3.5  Male   No         3
4  3.31 Male   No         2
5  3.61 Female No         4
6  4.71 Male   No         4
# … with 238 more rows
# sex에서 day까지 columns은 제외하고
tips |>
    select(!sex:day) |> # !: not
    print()
# A tibble: 244 × 4
  total_bill   tip time    size
       <dbl> <dbl> <fct>  <int>
1       17.0  1.01 Dinner     2
2       10.3  1.66 Dinner     3
3       21.0  3.5  Dinner     3
4       23.7  3.31 Dinner     2
5       24.6  3.61 Dinner     4
6       25.3  4.71 Dinner     4
# … with 238 more rows
# factor 타입의 변수들만 선택: 함수를 이용
tips |>
    select(where(is.factor)) |>  # 다른 함수들: is.numeric, is.character
    print()
# A tibble: 244 × 4
  sex    smoker day   time  
  <fct>  <fct>  <fct> <fct> 
1 Female No     Sun   Dinner
2 Male   No     Sun   Dinner
3 Male   No     Sun   Dinner
4 Male   No     Sun   Dinner
5 Female No     Sun   Dinner
6 Male   No     Sun   Dinner
# … with 238 more rows

다양한 select()의 선택방법은 ?select로 help참고
예를 들어, starts_with("abc")는 abc로 시작하는 열의 이름을 가진 열들

Note

Base R에서 행과 열의 선택과 비교하면,

cps[2:5, c("wage", "married")] # 2~5행과 wage, married열
# # A tibble: 4 × 2
#    wage married
#   <dbl> <fct>  
# 1   5.5 Married
# 2   3.8 Single 
# 3  10.5 Married
# 4  15   Married

cps |> 
    select(wage, married) |> 
    slice(2:5) # 행을 선택

relocate()

Columns의 순서를 변경

tips |> print(n = 2)
# A tibble: 244 x 7
  total_bill   tip sex    smoker day   time    size
       <dbl> <dbl> <fct>  <fct>  <fct> <fct>  <int>
1       17.0  1.01 Female No     Sun   Dinner     2
2       10.3  1.66 Male   No     Sun   Dinner     3
# i 242 more rows
tips |> 
    relocate(day, time) |>  # day, time을 맨 앞으로 이동
    print(n = 2)
# A tibble: 244 x 7
  day   time   total_bill   tip sex    smoker  size
  <fct> <fct>       <dbl> <dbl> <fct>  <fct>  <int>
1 Sun   Dinner       17.0  1.01 Female No         2
2 Sun   Dinner       10.3  1.66 Male   No         3
# i 242 more rows
tips |> 
    relocate(sex:time, tip) |>  # sex부터 time까지와 tip을 맨 앞으로 이동
    print(n = 2)
# A tibble: 244 x 7
  sex    smoker day   time     tip total_bill  size
  <fct>  <fct>  <fct> <fct>  <dbl>      <dbl> <int>
1 Female No     Sun   Dinner  1.01       17.0     2
2 Male   No     Sun   Dinner  1.66       10.3     3
# i 242 more rows
tips |> 
    relocate(day:size, .after = tip) |>   # .before: 앞에, .after: 뒤에
    print(n = 2)
# A tibble: 244 x 7
  total_bill   tip day   time    size sex    smoker
       <dbl> <dbl> <fct> <fct>  <int> <fct>  <fct> 
1       17.0  1.01 Sun   Dinner     2 Female No    
2       10.3  1.66 Sun   Dinner     3 Male   No    
# i 242 more rows

rename()

Columns의 이름을 변경

cps |>
    rename(education = educ, marital = married) |> # new = old
    print()
# A tibble: 534 × 11
   wage education race  sex   hispanic south marital exper union   age sector  
  <dbl>     <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct>   
1   9          10 W     M     NH       NS    Married    27 Not      43 const   
2   5.5        12 W     M     NH       NS    Married    20 Not      38 sales   
3   3.8        12 W     F     NH       NS    Single      4 Not      22 sales   
4  10.5        12 W     F     NH       NS    Married    29 Not      47 clerical
5  15          12 W     M     NH       NS    Married    40 Union    58 const   
6   9          16 W     F     NH       NS    Married    27 Not      49 clerical
# … with 528 more rows

변수를 select할 때 동시에 이름도 바꿀 수 있음

cps |>
    select(education = educ, marital = married) |> # new = old
    print()
# A tibble: 534 × 2
  education marital
      <int> <fct>  
1        10 Married
2        12 Married
3        12 Single 
4        12 Married
5        12 Married
6        16 Married
# … with 528 more rows

Groups

분석에서는 자주 카테고리별로 데이터를 나누어 통계치를 계산하곤 하는데,
group_by()summarise()의 두 함수를 함께 사용하여 가장 자주 사용하게 됨

group_by()

데이터셋을 분석을 위해 의미있는 그룹으로 나눔

다음은 성별로 데이터셋을 나눈 것인데, 실제 데이터를 수정하는 것은 아니고, 내부적으로 grouping되어 있음.
맨 위 줄에 보면 Groups: sex [2]로 표시되어 grouped data frame임을 명시함

cps |>
    group_by(sex) |> 
    print()
# A tibble: 534 × 11
# Groups:   sex [2]
   wage  educ race  sex   hispanic south married exper union   age sector  
  <dbl> <int> <fct> <fct> <fct>    <fct> <fct>   <int> <fct> <int> <fct>   
1   9      10 W     M     NH       NS    Married    27 Not      43 const   
2   5.5    12 W     M     NH       NS    Married    20 Not      38 sales   
3   3.8    12 W     F     NH       NS    Single      4 Not      22 sales   
4  10.5    12 W     F     NH       NS    Married    29 Not      47 clerical
5  15      12 W     M     NH       NS    Married    40 Union    58 const   
6   9      16 W     F     NH       NS    Married    27 Not      49 clerical
# … with 528 more rows

summarise()

summarize()와 동일
group별로 통계치를 구해 하나의 행으로 산출

# 남녀별로 임금의 평균을 구함
cps |>
    group_by(sex) |>
    summarise(
        avg_wage = mean(wage, na.rm = TRUE),  # mean(): 평균, na.rm: NA를 remove할 것인가
        n = n()  # n(): 개수
    ) |>
    print()
# A tibble: 2 × 3
  sex   avg_wage     n
  <fct>    <dbl> <int>
1 F         7.88   245
2 M         9.99   289

2개 이상의 변수들로 grouping할 수 있음

cps |>
    group_by(sex, married) |>
    summarize(
        ave_wage = mean(wage),
        sd_wage = sd(wage)) |>
    print()
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 4 × 4
# Groups:   sex [2]
  sex   married ave_wage sd_wage
  <fct> <fct>      <dbl>   <dbl>
1 F     Married     7.68    3.73
2 F     Single      8.26    6.23
3 M     Married    10.9     5.35
4 M     Single      8.35    4.78

이때, 결과 데이터 프레임은 sex로 grouping되어 있음.
grouping을 해제하려면 ungroup()이 필요함.
그렇지 않으면, 저 결과는 sex로 grouped data frame임


Useful summary functions
자세한 사항은 R for Data Science/Data transformation

  • Measures of location: mean(), median()
  • Measures of spread: sd(), IQR(), mad()
  • Measures of rank: min(), max(), quantile(x, 0.25)
  • Measures of position: min_rank(), first(), nth(x, 2), last()
  • Measures of count: count(), n_distinct()

Missing

R에서 missing values (결측치)는 NA로 표시
NaN (not a number)는 주로 계산 결과로 나오는데, 예들 들어 0으로 나눌 때처럼, R에서는 NA로 취급되니 크게 신경쓰지 않아도 됨. 자세한 사항은 R for Data Science/Missing values 참고

NA는 다음과 같은 성질을 지님

NA > 5
#> [1] NA
10 == NA
#> [1] NA
NA + 10
#> [1] NA
NA / 2
#> [1] NA
NA == NA
#> [1] NA

x <- NA
is.na(x)
#> [1] TRUE

NA는 filter()안의 조건문의 참거짓에 상관없이 모두 제외함

  • 실제로 조건문의 결과는 TRUE, FALSE로 이루어지짐
df <- tibble(
        one = c(1, NA, 3, 4, 2, NA), 
        two = c(2, 5, 3, NA, 10, NA), 
        three = c("a", "a", "a", "a", "b", "b")
    )
df
#     one   two three
#   <dbl> <dbl> <chr>
# 1     1     2 a    
# 2    NA     5 a    
# 3     3     3 a    
# 4     4    NA a    
# 5     2    10 b    
# 6    NA    NA b    

filter(df, one > 1)
#     one   two three
#   <dbl> <dbl> <chr>
# 1     3     3 a    
# 2     4    NA a    
# 3     2    10 b

# NA를 포함하고자 할 때,
filter(df, one > 1 | is.na(one))
#     one   two three
#   <dbl> <dbl> <chr>
# 1    NA     5 a    
# 2     3     3 a    
# 3     4    NA a    
# 4     2    10 b    
# 5    NA    NA b

# NA를 포함하지 않은 행들만
filter(df, !is.na(one))
filter(df, !is.na(one) & !is.na(two)) # one, two 열에 모두 NA가 없는 행들만

na.omit(df) # NA가 하나라도 있는 행은 모두 제거, 보통 결측치를 조심스럽게 대체한 후 사용
#     one   two three
#   <dbl> <dbl> <chr>
# 1     1     2 a    
# 2     3     3 a    
# 3     2    10 b 

# 함수 중에 NA를 직접 처리하는 경우들이 많음
mean(df$one)
## [1] NA

mean(df$one, na.rm = TRUE) # NA removed
## [1] 2.5

na.rm = TRUE로 얻은 계산값에서 몇 개의 데이터로 계산되었는지 알기 위해서는

df |> 
    group_by(three) |> 
    summarise(
        ave = mean(two, na.rm = TRUE), 
        n = n(), 
        n_notna = sum(!is.na(two))  # TRUE는 1로, FALSE는 0으로 계산됨
    )
#   three   ave     n n_notna
#   <chr> <dbl> <int>   <int>
# 1 a      3.33     4       3
# 2 b     10        2       1

Summary

다음 dplyr 패키지의 기본 verb 함수들로 데이터를 가공하면서 필요한 통계치를 구함

  • 조건에 맞는 행들(관측치)만 필터링: filter()
  • 열을 재정렬: arrange()
  • 변수들의 선택: select()
  • 변수들과 함수들을 이용하여 새로운 변수를 생성: mutate()
  • 원하는 요약 통계치를 간추림: summarise()