19 Groupby

GroupBy

  • 데이터베이스는 Group by절을 사용하여 Group으로 나눈후 집합(Group) 함수(count, avg, sum) 등을 사용할 수 있는데 Pandas에서도 이와 비슷한 개념인 Groupby 메소드가 존재함
  • Groupby Docs에 정의를 보면 매퍼에 의해 만들어진 Groupby 객체로 정의 됨(Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

    • grouped = df.groupby['col1'] groupby메소드를 사용하여 groupby 객체가 생성되고.
    • grouped.count(), grouped.mean(), grouped.sum()와 같이 생성된 groupby에서 집합함수를 적용할 수 있음.
  • groupby는 다음과 같은 iteration을 제공함

    • for name, group in df.groupby()
In [2]:
import numpy as np
import pandas as pd

arrays = [['Arizona','Boston','Chicago','Detroit', 'Arizona','Boston','Chicago','Detroit']
         ,['First','First','First','First','Second','Second','Second','Second']]
index = pd.MultiIndex.from_arrays(arrays, names=('City','Season'))
columns = [2003, 2004, 2005]
#df = pd.DataFrame(np.random.randn(8, 3), index=index, columns = columns)
df = pd.DataFrame(np.arange(24).reshape((-1, 3)), index=index, columns = columns)
df
Out[2]:
200320042005
CitySeason
ArizonaFirst012
BostonFirst345
ChicagoFirst678
DetroitFirst91011
ArizonaSecond121314
BostonSecond151617
ChicagoSecond181920
DetroitSecond212223
In [4]:
# column을 기준으로 groupby 
grouped = df.groupby(2004)
#-grouped.count(),
print(grouped.count())
#-grouped.mean(),
print((grouped.mean()))
# grouped.sum()
print(grouped.sum())
      2003  2005
2004            
1        1     1
4        1     1
7        1     1
10       1     1
13       1     1
16       1     1
19       1     1
22       1     1
      2003  2005
2004            
1        0     2
4        3     5
7        6     8
10       9    11
13      12    14
16      15    17
19      18    20
22      21    23
      2003  2005
2004            
1        0     2
4        3     5
7        6     8
10       9    11
13      12    14
16      15    17
19      18    20
22      21    23
In [3]:
# index 중에 하나인 'Season' 기준으로 groupby 
grouped = df.groupby('Season')
print(grouped.count())

(grouped.mean())
        2003  2004  2005
Season                  
First      4     4     4
Second     4     4     4
Out[3]:
200320042005
Season
First4.55.56.5
Second16.517.518.5
In [4]:
# index의  array를  기준으로 groupby 
grouped = df.groupby(['City','Season'])
print(grouped.count())
(grouped.mean())
                2003  2004  2005
City    Season                  
Arizona First      1     1     1
        Second     1     1     1
Boston  First      1     1     1
        Second     1     1     1
Chicago First      1     1     1
        Second     1     1     1
Detroit First      1     1     1
        Second     1     1     1
Out[4]:
200320042005
CitySeason
ArizonaFirst012
Second121314
BostonFirst345
Second151617
ChicagoFirst678
Second181920
DetroitFirst91011
Second212223
In [5]:
# groupby iteration
for name, group in df.groupby('City'):
    print(name)
    print(group)
Arizona
                2003  2004  2005
City    Season                  
Arizona First      0     1     2
        Second    12    13    14
Boston
               2003  2004  2005
City   Season                  
Boston First      3     4     5
       Second    15    16    17
Chicago
                2003  2004  2005
City    Season                  
Chicago First      6     7     8
        Second    18    19    20
Detroit
                2003  2004  2005
City    Season                  
Detroit First      9    10    11
        Second    21    22    23
In [6]:
# groupby iteration2
for (n1 ,n2), group in df.groupby(['City','Season']):
    print(n1, n2)
    print(group)
Arizona First
                2003  2004  2005
City    Season                  
Arizona First      0     1     2
Arizona Second
                2003  2004  2005
City    Season                  
Arizona Second    12    13    14
Boston First
               2003  2004  2005
City   Season                  
Boston First      3     4     5
Boston Second
               2003  2004  2005
City   Season                  
Boston Second    15    16    17
Chicago First
                2003  2004  2005
City    Season                  
Chicago First      6     7     8
Chicago Second
                2003  2004  2005
City    Season                  
Chicago Second    18    19    20
Detroit First
                2003  2004  2005
City    Season                  
Detroit First      9    10    11
Detroit Second
                2003  2004  2005
City    Season                  
Detroit Second    21    22    23
In [7]:
df
Out[7]:
200320042005
CitySeason
ArizonaFirst012
BostonFirst345
ChicagoFirst678
DetroitFirst91011
ArizonaSecond121314
BostonSecond151617
ChicagoSecond181920
DetroitSecond212223
In [8]:
#groupby후 DataFrame을 dictionary에 넣음 
dic_group = dict(list(df.groupby('Season')))
In [9]:
dic_group
Out[9]:
{'First':                 2003  2004  2005
 City    Season                  
 Arizona First      0     1     2
 Boston  First      3     4     5
 Chicago First      6     7     8
 Detroit First      9    10    11, 'Second':                 2003  2004  2005
 City    Season                  
 Arizona Second    12    13    14
 Boston  Second    15    16    17
 Chicago Second    18    19    20
 Detroit Second    21    22    23}
In [10]:
# dictionary에 요소하나인 DataFrame을 조회
print(type(dic_group['First']))
dic_group['Second']
<class 'pandas.core.frame.DataFrame'>
Out[10]:
200320042005
CitySeason
ArizonaSecond121314
BostonSecond151617
ChicagoSecond181920
DetroitSecond212223


© 2017. All rights reserved.

Powered by ZooFighter v0.12