02_Create DataFrame

30 Sep 2017 | 5 Minute Read on Pandas

Create DataFrame¶

Pandas에서는 DataBase의 테이블과 비슷한 자료구조를 DataFrame라고 함
DataFrame은 data, index, columns_index로 구성되어 있음
- data는 2차원 데이터 구조로 numpy의 ndarrary나 python의 dict, list 형태로 숫자나 문자로 구성되어 있음
- columns은 DataBase의 컬럼 처럼 자료 구조 형이 있음. 컬럼에는 문자나 숫자 같은 유형을 가질 수 있음
- index는 생략할때가 많고 생략되어진다면 기본적으로 np.arange(n)로 표현됨(즉, 0에서 row 데이터수 -1 까지의 값을 가짐)
DataFrame을 만드는 방법에는 DataFrame(), read_csv(), read_excel() 등 다수의 방법이 있음.
index, columns이라는 명칭보다는 rows_index, columns_index라는 명칭이 좀 더 의미를 파악하는데 도움이 됨

In [1]:

import pandas as pd
import numpy as np
data  = np.arange(12).reshape((3, 4))
#로 표현 할 수 있음


index = [2003, 2004, 2005 ]
columns = ['Arizona','Boston', 'Chicago','Detroit']
df = pd.DataFrame(data = data, index = index , columns = columns)

print(df)

      Arizona  Boston  Chicago  Detroit
2003        0       1        2        3
2004        4       5        6        7
2005        8       9       10       11

In [2]:

#df.info를 사용하면 DataFrame의 정보를 확인할 수 있음
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 2003 to 2005
Data columns (total 4 columns):
Arizona    3 non-null int64
Boston     3 non-null int64
Chicago    3 non-null int64
Detroit    3 non-null int64
dtypes: int64(4)
memory usage: 120.0 bytes

DataFrame 관련 Pandas Docs 을 보면

DataFrame 메소드는는 추가적으로 data, index, columns 뿐만 아니라 dtype, copy 파라미터를 추가적으로 입력받을 수 있음
dtype은 컬럼에서 유형을 명시적으로 선언할 수도 있음
선언이 되어 있지 않으면 Pandas가 알아서 columns의 타입을 결정함

In [3]:

# 컬럼이름과 타입을 확인할 수
print(df.columns)
print(df.dtypes)

Index(['Arizona', 'Boston', 'Chicago', 'Detroit'], dtype='object')
Arizona    int64
Boston     int64
Chicago    int64
Detroit    int64
dtype: object

In [4]:

# 인덱스의 이름을 확인할 수 있음
df.index

Out[4]:

Int64Index([2003, 2004, 2005], dtype='int64')

Dictionary를 이용하여 DataFrame을 만드는 방법¶

data, columns, index를 각각 선언하지 않고 columns과 데이터를 딕셔너리를 구조로 하여 DataFrame을 만들 수 있음
간단히 데이터를 만들 때 선호되는 방식임

In [5]:

dicts ={'Arizona':[0, 4 ,8 ]
      ,'Boston': [1, 5, 9]
      ,'Chicago':[2, 6, 10]
      ,'Detroit':[3, 7, 11]}
df2 = pd.DataFrame(dicts)

In [6]:

print(df2)

   Arizona  Boston  Chicago  Detroit
0        0       1        2        3
1        4       5        6        7
2        8       9       10       11

In [7]:

#data, columns, index를 사용하여 만든 df와 비교
#Data와 Column은 같으나 Index가 다르기 때문에 비교 할 수 었음
#df == df2
#(ValueError: Can only compare identically-labeled DataFrame objects)

In [8]:

#index를 df2.index에 할당
df2.index = index

In [9]:

#index와 columns, data 가 같기 때문에 비교 가능함
print(df2)
(df == df2)

      Arizona  Boston  Chicago  Detroit
2003        0       1        2        3
2004        4       5        6        7
2005        8       9       10       11

Out[9]:

	Arizona	Boston	Chicago	Detroit
2003	True	True	True	True
2004	True	True	True	True
2005	True	True	True	True