groupby

Published by onesixx on 22-04-2722-04-27

Groupby => Apply를 사용하기 위해 사용

https://morningcoding.tistory.com/entry/Python-62-pandas-agg-vs-apply

import pandas as pd
import numpy as np
import plotly.express as px

# row 생략 없이 출력
pd.get_option('display.max_rows')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 60)

###  Load data ------------------------------
iris = px.data.iris()
iris['species'].unique()
# array(['setosa', 'versicolor', 'virginica'], dtype=object)

###  Groupby => Apply를 사용하기 위해 사용 ------------------------------
grouped = iris.groupby(['species'])

###  Groupby object (tuple :: dataframe)------------------------------
for groupName, data in grouped:
    print(f" {groupName} :: {data.shape}")
#  ('setosa',) :: (50, 6)
#  ('versicolor',) :: (50, 6)
#  ('virginica',) :: (50, 6)

grouped.size()
# species
# setosa        50
# versicolor    50
# virginica     50
# dtype: int64
grouped.count()

grouped.groups               # {그룹명 : [인덱스], ...}
grouped.groups.keys()        # 그룹명을 key로 확인
grouped.get_group('setosa')  # 특정 그룹의 데이터 확인

grouped.first()                  # 각 그룹별  첫번째 행

### 연산 ------------------------------
def uf_minus(x):
    return x - 1
grouped.apply(uf_minus)
grouped['sepal_length'].apply(uf_minus)
grouped.apply(lambda x: uf_minus(x['sepal_length']))

# transform 기존 데이터프레임의 인덱스와 동일하게 결과를 반환
grouped.transform(uf_minus)

# === aggregation
# 전체 데이터(열)에 대한, 그룹별 평균
grouped.agg('mean')       # grouped.mean()  # same above

# aggregation 특정 전체 데이터(열)에 대한,  그룹별 각 열의 연산
grouped[['sepal_length', 'sepal_width']].agg(['mean', 'sum'])

grouped.agg({
    'sepal_length':'mean',
    'sepal_width':'mean'
})

apply vs agg

apply가 가장 큰 개념.

agg함수는 apply 중 특정조건일때 사용하는 apply

# agg함수는 apply중에서도 <통계량집계>함수를 적용하여, scala리턴 특별한 apply함수

iris.iloc[:,:4].apply('mean')
iris.iloc[:,:4].apply('mean', axis=1) # 행별 집계

iris[['sepal_length', 'sepal_width']].apply(['mean', 'sum'])

iris.apply({
    'sepal_length':['mean', 'sum'],
    'sepal_width':['mean', 'sum']
})
# function을 lambda로 사용, default가 axis=0이므로 x는 열을 의미
# 긴 로직은 lambda 대신 function을  정의하여 사용
iris.iloc[:,:4].apply(
    lambda x: x.mean()
)

def get_diff(row):
    return row.sepal_length - row.petal_length
iris.apply(get_diff, axis=1)

iris.apply(lambda row: row['sepal_length'] - row['petal_length'], axis=1)
# iris.apply(lambda row: row.sepal_length - row.petal_length, axis=1)
dd = iris.copy()
dd['length_diff'] = dd.apply(lambda row: row['sepal_length'] - row['petal_length'], axis=1)

iris[
    iris.apply (lambda row: row['sepal_length'] > 7.6, axis=1)
] 

iris.iloc[:,:4].apply(lambda col: col +1)
iris.iloc[:,:4].apply(lambda col: col.apply(lambda x: x*100 if x == 5.1 else x))


#  함수가 <집계>함수 => 모두 agg 대체 가능
iris.iloc[:,:4].agg('mean')
iris.iloc[:,:4].agg('mean', axis=1) # 행별 집계

groupby

Groupby => Apply를 사용하기 위해 사용

apply vs agg

onesixx

data-tips

reindex

dataframe

groupby

Groupby => Apply를 사용하기 위해 사용

apply vs agg

onesixx

Related Posts

data-tips

reindex

dataframe