[R 프로그래밍] 데이터 가공 - 그룹 별로 나눠서 요약 정보 보기 group

#group_by()

#데이터를 그룹 별로 묶기

#데이터에서 특정 컬럼을 지정해서 그룹 별로 묶을 수 있다.

#dataSample에서 class 별로 묶기

dataSample %>% group_by(class)

> exam %>% group_by(class)

# A tibble: 20 x 5

# Groups: class [5]

id class math english science

1 1 1 50 98 50

2 2 1 60 97 60

3 3 1 45 86 78

4 4 1 30 98 58

5 5 2 25 80 65

...

-> A ttible : 데이터가 5열, 20행으로 이루어졌다는 것을 표시

-> Group : class별 그룹이 5개가 생긴다. 그룹을 나눠서 보여주지는 않지만...

group_by()는 출력한 결과를 tibble형태로 만든다.

tibble 데이터 프레임에 기능이 더 추가됨.

#summarise()

#요약한 통계랑 보여주기

dataSample %>% summarise(mean(math))

> dataSample %>% summarise(mean(math))

mean(math)

1 57.45

#summarise에서 변수를 지정하면, 요약된 값을 보여줄 때, 해당 변수 이름으로 보여준다.

dataSample %>% summarise(mean_math=mean(math))

> dataSample %>% summarise(mean_math=mean(math))

mean_math

1 57.45

#그룹 별로 묶어서 해당 그룹의 특정 컬럼 평균 보기

#반 별로 묶어서 수학 점수 평균 보기

dataSample %>%

group_by(class) %>%

summarise (math_mean = mean(math))

> dataSample %>%

+ group_by(class) %>%

+ summarise(mean_math = mean(math))

# A tibble: 5 x 2

class mean_math

1 1 46.2

2 2 61.2

3 3 45

4 4 56.8

5 5 78

#group_by()그룹 안에 그룹 나누기

#콤마로 그룹 안의 그룹을 생성해서 데이터를 볼 수 있다.

#class가 1반인 사람들 중, science_test가 Pass와 fail인 학생으로 구분

exam_new %>%

group_by(class, science_test) %>%

summarise(mean(science),median(science),n())

> #class가 1반인 사람들 중, science_test가 Pass와 fail인 학생으로 구분하고 science의 평균값과 중간값 구하기

> exam_new %>%

+ group_by(class, science_test) %>%

+ summarise(mean(science),median(science),n())

# A tibble: 10 x 5

# Groups: class [?]

class science_test `mean(science)` `median(science)` `n()`

1 1 fail 56 58 3

2 1 pass 78 78 1

3 2 fail 35 35 2

4 2 pass 81.5 81.5 2

5 3 fail 30.7 32 3

6 3 pass 65 65 1

7 4 fail 12 12 1

8 4 pass 69.3 65 3

9 5 fail 58 58 1

10 5 pass 91.7 90 3

#summarise의 요약 통계량 함수

#()안에 평균을 구할 컬럼 입력

mean() 평균

sd() 표준편차

sum() 합계

median() 중간값

min() 최소값

max() 최대값

#그 외 summarise에서 사용할 수 있는 함수

n() 행(row)의 개수를 세는 것으로 ()안에 아무것도 입력하지 않음

summary()와 summarise()

#summarise는%>%로 어떤 함수를 쓸 것인지 조건 필요

> summarise(exam)

data frame with 0 columns and 0 rows

> summary(exam)

id class math english science

Min. : 1.00 Min. :1 Min. :20.00 Min. :56.0 Min. :12.00

1st Qu.: 5.75 1st Qu.:2 1st Qu.:45.75 1st Qu.:78.0 1st Qu.:45.00

Median :10.50 Median :3 Median :54.00 Median :86.5 Median :62.50

Mean :10.50 Mean :3 Mean :57.45 Mean :84.9 Mean :59.45

3rd Qu.:15.25 3rd Qu.:4 3rd Qu.:75.75 3rd Qu.:98.0 3rd Qu.:78.00

Max. :20.00 Max. :5 Max. :90.00 Max. :98.0 Max. :98.00

#summarise를 사용하면 다음과 같이 그룹 별 데이터를 나눠서 볼 수 있다.

> #class 별로 묶어서 수학 점수의 평균, 합계, 중간값, 최소값, 최대값, 표준편차, 각 class의 행 개수를 알아보자.

> exam %>%

+ group_by(class) %>%

+ summarise(mean_math = mean(math),

+ sum_math = sum(math),

+ median_math = median(math),

+ min_math = min(math),

+ max_math = max(math),

+ rowCount_math = n()

+ )

# A tibble: 5 x 7

class mean_math sum_math median_math min_math max_math rowCount_math

1 1 46.2 185 47.5 30 60 4

2 2 61.2 245 65 25 90 4

3 3 45 180 47.5 20 65 4

4 4 56.8 227 53 46 75 4

5 5 78 312 79 65 89 4

#summary를 사용하면 그냥 현재 보이는 전체 테이블 기준으로 각 행의 통계값을 계산해준다.

#즉, math 전체의 통계는 보여주지만, group_by로 class를 나누라고 해도 class 별로 수학 평균을 보여주지는 않는다.

> summary(exam %>% group_by(class))

id class math english science

Min. : 1.00 Min. :1 Min. :20.00 Min. :56.0 Min. :12.00

1st Qu.: 5.75 1st Qu.:2 1st Qu.:45.75 1st Qu.:78.0 1st Qu.:45.00

Median :10.50 Median :3 Median :54.00 Median :86.5 Median :62.50

Mean :10.50 Mean :3 Mean :57.45 Mean :84.9 Mean :59.45

3rd Qu.:15.25 3rd Qu.:4 3rd Qu.:75.75 3rd Qu.:98.0 3rd Qu.:78.00

Max. :20.00 Max. :5 Max. :90.00 Max. :98.0 Max. :98.00

#dplyr패키지의 다양한 함수 조합하기

#class 별로 과학 시험에서 패스한 사람들의 전체 과목 점수(수학, 영어, 과학)의 반 별 평균 구해서 내림차순으로 정렬하고 1~3위까지 출력해보자.

exam_new %>%

group_by(class) %>%

filter(science_test == "pass") %>%

mutate(total = (math+english+science)/3) %>%

summarise(mean_total = mean(total)) %>%

arrange(desc(mean_total)) %>%

head(3)

# A tibble: 3 x 2

class mean_total

1 5 80.3

2 4 71

3 1 69.7

#어떤 서버에서 "마법사" 직업을 많이하는지 알아보자

#각 서버 별로 그룹 분리.

#"마법사" 직업 수를 내림차순으로 정렬해서 출력하자

mpg %>%

group_by(server) %>%

filter (class == "wizard") %>%

summarise(wizard_count = n()) %>%

arrange(desc(wizard_count))

# A tibble: 5 x 2

server wizard_count

1 korea 1524

2 us 1423

3 japan 1231

4 china 423

5 taiwan 231

'데이터분석 > R' 카테고리의 다른 글

[R 프로그래밍] 데이터 프레임 생성 시 stringsAsFactors 옵션 사용하기 (0)	2018.09.01
[R 프로그래밍] 데이터 가공 - 데이터 합치기 : left_join(), bind_rows() (dplyr) (1)	2018.09.01
[R 프로그래밍] 데이터 가공 - 컬럼 추가하기 : mutate() (dplyr) (0)	2018.08.24
[R 프로그래밍] 데이터 가공 - arrange()로 정렬하기 (dplyr) (0)	2018.08.24
[R 프로그래밍] 데이터 가공 - select()로 필요한 변수만 추출하자 (dplyr) (0)	2018.08.24

안녕!

[R 프로그래밍] 데이터 가공 - 그룹 별로 나눠서 요약 정보 보기 group_by(), summarise() (dplyr)

'데이터분석 > R' 카테고리의 다른 글

티스토리툴바

[R 프로그래밍] 데이터 가공 - 그룹 별로 나눠서 요약 정보 보기 group_by(), summarise() (dplyr)

'데이터분석 > R' 카테고리의 다른 글

'데이터분석/R' Related Articles

티스토리툴바