그룹핑

판다스에서 그룹별로 통계를 내는 방법을 알아봅니다.

import pandas as pd

df = pd.read_excel('census.xlsx')

그룹짓기

income 변수에는 >50K와 <=50K 두 개의 값이 있습니다.

df['income'].unique()

array(['<=50K', '>50K'], dtype=object)

income을 기준으로 그룹을 지어보겠습니다.

gr = df.groupby('income')

값이 두 종류이므로 두 개의 그룹으로 나뉘게 됩니다.

gr.groups

{'<=50K': Int64Index([    0,     1,     2,     3,     4,     5,     6,    12,    13,
                15,
             ...
             32548, 32549, 32550, 32551, 32552, 32553, 32555, 32556, 32558,
             32559],
            dtype='int64', length=24720),
 '>50K': Int64Index([    7,     8,     9,    10,    11,    14,    19,    20,    25,
                27,
             ...
             32530, 32532, 32533, 32536, 32538, 32539, 32545, 32554, 32557,
             32560],
            dtype='int64', length=7841)}

>50K 그룹의 데이터만 보겠습니다.

gr.get_group('>50K').head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	45	United-States	>50K
8	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	50	United-States	>50K
9	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	40	United-States	>50K
10	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	80	United-States	>50K
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	40	India	>50K

그룹별 통계

두 그룹에서 education_num 변수의 평균을 보겠습니다.

gr['education_num'].mean()

income
<=50K     9.595065
>50K     11.611657
Name: education_num, dtype: float64

agg 메소드를 이용하면 좀 더 다양한 통계를 내겠습니다. 먼저 아래는 똑같이 education_num의 평균을 내는 코드입니다. 평균(mean), 합계(sum), 최대(max), 최소(min), 분산(var), 표준편차(std) 등은 문자열로 넣어주면 해당 함수를 자동으로 적용해줍니다.

gr.agg({'education_num': 'mean'})

	education_num
income
<=50K	9.595065
>50K	11.611657

여러 가지 통계를 구할 때는 아래처럼 리스트로 넣어주면 됩니다.

gr.agg({'education_num': ['mean', 'std']})

	education_num
	mean	std
income
<=50K	9.595065	2.436147
>50K	11.611657	2.385129

여러 컬럼의 통계도 한 번에 구할 수 있습니다.

gr.agg(
    {
        'education_num': ['mean', 'std'],
        'capital_gain': 'mean'
    }
)

	education_num		capital_gain
	mean	std	mean
income
<=50K	9.595065	2.436147	148.752468
>50K	11.611657	2.385129	4006.142456

'mean'이라고 문자열로 입력하는 대신, 함수를 넘겨주면 해당 함수를 직접 적용해줍니다.

import numpy as np

gr.agg({'education_num': np.mean})

	education_num
income
<=50K	9.595065
>50K	11.611657

그룹핑

그룹짓기

그룹별 통계

category