다운캐스팅 (Downcasting)

Recent Posts

Recent Comments

Link

깃헙

Today

Total

02-13 14:24

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

관리 메뉴

Hippo's data

다운캐스팅 (Downcasting) 본문

Python

다운캐스팅 (Downcasting)

Hippo's data 2025. 1. 23. 15:38

728x90

오늘은 다운캐스팅(Downcasting)에 대해 알아보겠습니다!

데이터를 불러와서 모델링을 하다보면 데이터가 너무 커서 속도가 너무 오래걸리거나, 메모리 초과(OOM: out-of-memory) 오류가 발생하는 경우가 종종 있는데욥!

이를 해결하기 위해 데이터 사이즈를 줄이는 다운캐스팅(Downcasting) 방법을 이용할 수 있습니다!!

파이썬(Python)에서는 정수, 부동소수점 변수 저장시 메모리에서 차지하는 바이트에 기반하여 데이터 타입을 선택하게되는데욥 데이터 값을 손상시키지 않으면서 각 변수를 저장하는 가장 용량이 적은(메모리를 적게 사용하는) 데이터 타입으로 저장하여 데이터의 총 사이즈를 줄일 수 있습니다. 즉, 무손실 압축(lossless compression)이 가능합니다!

예) int64 타입을 int32, int16, int8로 다운캐스팅

float64를 float32나 float16으로 다운캐스팅

# 장점

데이터 처리 속도 빨라짐, 사용하는 메모리 공간 줄어듦

ver1

def downcast(df, verbose=True):

start_mem = df.memory_usage().sum() / 1024**2

for col in df.columns:

dtype_name = df[col].dtype.name

if dtype_name == 'object':

pass

elif dtype_name == 'category': # Use 'category' for categorical dtype

pass

elif dtype_name == 'bool':

df[col] = df[col].astype('int8')

elif dtype_name.startswith('int') or (df[col].dtype == 'float' and (df[col] % 1 == 0).all()):

df[col] = pd.to_numeric(df[col], downcast='integer')

else:

df[col] = pd.to_numeric(df[col], downcast='float')

end_mem = df.memory_usage().sum() / 1024**2

if verbose:

print('{:.1f}% 압축됨'.format(100 * (start_mem - end_mem) / start_mem))

return df

downcast(train)

ver2

def reduce_mem_usage(df):

""" iterate through all the columns of a dataframe and modify the data type

to reduce memory usage.

"""

start_mem = df.memory_usage().sum() / 1024**2

print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

for col in df.columns:

col_type = df[col].dtype

if col_type != object:

c_min = df[col].min()

c_max = df[col].max()

if str(col_type)[:3] == 'int':

if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:

df[col] = df[col].astype(np.int8)

elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:

df[col] = df[col].astype(np.int16)

elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:

df[col] = df[col].astype(np.int32)

elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:

df[col] = df[col].astype(np.int64)

else:

if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:

df[col] = df[col].astype(np.float16)

elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:

df[col] = df[col].astype(np.float32)

else:

df[col] = df[col].astype(np.float64)

else:

df[col] = df[col].astype('category')

end_mem = df.memory_usage().sum() / 1024**2

print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))

print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

return df

train = reduce_mem_usage(train)

https://www.kaggle.com/code/gemartin/load-data-reduce-memory-usage

load data (reduce memory usage)

Explore and run machine learning code with Kaggle Notebooks | Using data from Home Credit Default Risk

www.kaggle.com

ver1: int, float, bool 타입 최적화

ver2: int, float, bool + object -> category 타입 최적화

ver2가 더 많은 메모리 최적화가 가능함 but 너무 작게 줄이게 되어 계산 오류가 발생할 수 있음

예) float16 타입으로 매우 작은 범위의 값으로 줄였을 경우, 연산시 오버/언더플로우 발생으로 연산오류 발생가능( 평균계산시 INF 값 나오는 등)

728x90

저작자표시

'Python' 카테고리의 다른 글

파이썬 continue, break, pass 차이점 (1)	2024.09.12
판다스 데이터프레임 생략없이 출력 (0)	2024.02.19
함수(Function), 메소드(Method) 차이 (2)	2024.01.30
데이터프레임 다루기3 - 데이터프레임 병합 concat, merge, join (1)	2023.12.17
데이터프레임 다루기2 - 데이터 프레임 변형 groupby, pivot, stack (0)	2023.12.11

'Python' Related Articles

Hippo's data

다운캐스팅 (Downcasting) 본문

다운캐스팅 (Downcasting)

ver1

ver2

'Python' 카테고리의 다른 글

티스토리툴바