10_문자함수

09 Oct 2017 | 5 Minute Read on Pandas

문자함수¶

Pandas에는 DataFrame에서는 직접 문자함수를 사용할 수 는 없고 Series로 변환후 문자 함수를 적용 할 수 있음.
Series.str docs를 보면 Series.str 메소드을 사용하여 Series와 Index에 python의 문자열 함수를 사용할 수 있음 (Vectorized string functions for Series and Index)
대표적인 메소들들은 다음과 같다.
- str.len(): 문자의 길이를 반환
- str[]: slicing을 적용(sql의 subsring 처럼 사용할 수 있음)
- str.split(): 구분자에 의해 문자열을 분해
- str.cat(): 문장열을 연결
- str.get(): 위치에 따라 요소를 반환
- str.replace(): 문자를 서로 치환
- str.contains() : 문자가 포함 되어 있는지 boolean array를 반환
- str.find(): 찾는 문자가 있으면 위치를 반환

In [1]:

import pandas as pd
import numpy as np
data  = np.arange(12).reshape((-1, 4))
#data = np.random.randn(3,4)

index = [2003, 2004, 2005 ]
columns= ['Arizona','Boston', 'Chicago','Detroit']
df = pd.DataFrame(data = data, index = index, columns = columns)

In [2]:

data = ([[ row + '_' +str(col) for row in df.columns] for col in df.index])
data

Out[2]:

[['Arizona_2003', 'Boston_2003', 'Chicago_2003', 'Detroit_2003'],
 ['Arizona_2004', 'Boston_2004', 'Chicago_2004', 'Detroit_2004'],
 ['Arizona_2005', 'Boston_2005', 'Chicago_2005', 'Detroit_2005']]

In [3]:

df = pd.DataFrame(data = data, index = index, columns = columns)
df

Out[3]:

	Arizona	Boston	Chicago	Detroit
2003	Arizona_2003	Boston_2003	Chicago_2003	Detroit_2003
2004	Arizona_2004	Boston_2004	Chicago_2004	Detroit_2004
2005	Arizona_2005	Boston_2005	Chicago_2005	Detroit_2005

In [4]:

#- str.len() 크기
df['Boston'].str.len()

Out[4]:

2003    11
2004    11
2005    11
Name: Boston, dtype: int64

In [5]:

# - str[]  문자열 slicing을 취함
df['Boston'].str[3:9]

Out[5]:

2003    ton_20
2004    ton_20
2005    ton_20
Name: Boston, dtype: object

In [6]:

# split메소드를 사용하여 구분자 '_'에 따라 문장열을 분리
df['Arizona'].str.split('_')

Out[6]:

2003    [Arizona, 2003]
2004    [Arizona, 2004]
2005    [Arizona, 2005]
Name: Arizona, dtype: object

In [7]:

# split메소드 사용후 get메소드를 사용하면 요소를 반환할 수 있음
df['Arizona'].str.split('_').str.get(1)

Out[7]:

2003    2003
2004    2004
2005    2005
Name: Arizona, dtype: object

In [8]:

# cat을 사용하여 문자열 연결
df['Arizona'].str.cat(df['Detroit'])

Out[8]:

2003    Arizona_2003Detroit_2003
2004    Arizona_2004Detroit_2004
2005    Arizona_2005Detroit_2005
Name: Arizona, dtype: object

In [9]:

# replace을 사용하여 '_'를  'Cups'로 치환
df['Chicago'].str.replace('_', 'Cups')

Out[9]:

2003    ChicagoCups2003
2004    ChicagoCups2004
2005    ChicagoCups2005
Name: Chicago, dtype: object

In [10]:

# contains 사용하여 '2003' 이 있는지 확인
df['Arizona'].str.contains('2003')

Out[10]:

2003     True
2004    False
2005    False
Name: Arizona, dtype: bool

In [11]:

#find 위치를 반환, 없으면 -1을 반환
df['Arizona'].str.find('2003')

Out[11]:

2003    8
2004   -1
2005   -1
Name: Arizona, dtype: int64