파이썬을 활용한 이커머스 데이터분석_강의를 듣고 따라했던 코딩과 요점을 정리하였다.
- 출처: fast campus
Chapter03. 광고 반응률 예측 (Logistic Regression)¶
분석의 목적¶
Logistic Regression 은 Linear Regression을 기반으로 만들어진 모델인데, 차이점이 있다면
- Linear Regression 은 (연간 지출액 등..) 연속된 숫자의 어딘가를 예측하는 알고리즘
- Logistic Regression 은 Yes or No 두가지 중 어디에 속하는지, 이진분류를예측하는 머신러닝 알고리즘이다.
우리가 다를 데이터는 광고관련 데이터이며, y 값은 이 광고를 클릭을 했는지 안했는지 / input 데이터는 성별,나이 등등을 이용할 것이다.
데이터 불러오기¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.read_csv('./data/advertising.csv')
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
.read_csv 명령어로 데이터만을 불러 올 수 있다. 그냥 불로오기만 했을뿐이다. 그래서 data = 추가해서 'data' 라고 정해주자.
data = pd.read_csv('./data/advertising.csv')
data
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
그리고, 데이터를 확인하기 위한 함수를 사용하여 살펴보자.
data.head(10)
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
5 | 59.99 | 23.0 | 59761.56 | 226.74 | Sharable client-driven software | Jamieberg | 1 | Norway | 5/19/2016 14:30 | 0 |
6 | 88.91 | NaN | 53852.85 | 208.36 | Enhanced dedicated support | Brandonstad | 0 | Myanmar | 1/28/2016 20:59 | 0 |
7 | 66.00 | 48.0 | 24593.33 | 131.76 | Reactive local challenge | Port Jefferybury | 1 | Australia | 3/7/2016 1:40 | 1 |
8 | 74.53 | 30.0 | 68862.00 | 221.51 | Configurable coherent function | West Colin | 1 | Grenada | 4/18/2016 9:33 | 0 |
9 | 69.88 | 20.0 | 55642.32 | 183.82 | Mandatory homogeneous architecture | Ramirezton | 1 | Ghana | 7/11/2016 1:42 | 0 |
특성(컬럼) 도메인 확인하기¶
- Clicked on AD 가 우리가 알고자하는 종속변수이다. 이 사람이 광고를 클릭했는지 안했는지. 예측하는 것 ['0' 클릭했다 , '1'클릭은 안했다]
- Daily Time Spent on Site : 이 사이트에서 시간을 얼마나 보냈는지
- Age : 나이 (NaN:결측치)
- Area Income : 그 지역에 대한 소득 (개인에 대한 소득은 금융권 데이터가 아닌이상 알기가 힘들다.)
- Daily Internet Usage : 인터넷을 하루에 얼마나 쓰는지
- Ad Topic Line : 광고에 대한 설명 (종속변수를 확인하는데에 중요하지 않을 것으로 판단된다.)
- City : 도시
- Male : 성별 ['0' 여자 , '1'은 남자] 한번 가공이 된 데이터 이다.
- Country : 나라
- Timestamp : 시간과 관련된 데이터
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Daily Time Spent on Site 1000 non-null float64 1 Age 916 non-null float64 2 Area Income 1000 non-null float64 3 Daily Internet Usage 1000 non-null float64 4 Ad Topic Line 1000 non-null object 5 City 1000 non-null object 6 Male 1000 non-null int64 7 Country 1000 non-null object 8 Timestamp 1000 non-null object 9 Clicked on Ad 1000 non-null int64 dtypes: float64(4), int64(2), object(4) memory usage: 78.2+ KB
Age 에 null값(결측치)가 있음을 확인 할 수 있다. float64는 소수점이 있는 숫자 , int64 는 소수점이 없는 숫자, object는 텍스트 이다.
data.describe()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | Clicked on Ad | |
---|---|---|---|---|---|---|
count | 1000.000000 | 916.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 |
mean | 65.000200 | 36.128821 | 55000.000080 | 180.000100 | 0.481000 | 0.50000 |
std | 15.853615 | 9.018548 | 13414.634022 | 43.902339 | 0.499889 | 0.50025 |
min | 32.600000 | 19.000000 | 13996.500000 | 104.780000 | 0.000000 | 0.00000 |
25% | 51.360000 | 29.000000 | 47031.802500 | 138.830000 | 0.000000 | 0.00000 |
50% | 68.215000 | 35.000000 | 57012.300000 | 183.130000 | 0.000000 | 0.50000 |
75% | 78.547500 | 42.000000 | 65470.635000 | 218.792500 | 1.000000 | 1.00000 |
max | 91.430000 | 61.000000 | 79484.800000 | 269.960000 | 1.000000 | 1.00000 |
인덱싱을 해보자
data['Area Income']
0 61833.90 1 68441.85 2 59785.94 3 54806.18 4 73889.99 ... 995 71384.57 996 67782.17 997 42415.72 998 41920.79 999 29875.80 Name: Area Income, Length: 1000, dtype: float64
인덱싱을 한것을 시각화 해본다.
sns.distplot(data['Area Income'])
/home/ubuntu/.local/lib/python3.6/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Area Income', ylabel='Density'>
sns.distplot(data['Age'])
/home/ubuntu/.local/lib/python3.6/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Age', ylabel='Density'>
Country 가 얼마나 있는지 확인해 보자.
data['Country'].nunique()
237
237개의 국가가 있다.
data['City'].nunique()
969
969개의 도시가 있다. 1000개 중에 거의 중복된 도시가 없다. 라고 해석할 수 있다.
data['Ad Topic Line'].nunique()
1000
1000개 중에 1000이란 값이 나왔다. 고유값이 전부다 다르다. 여기 과정에서는 Drop하고 진행하고자 한다. 이 방법이 정답은 아니다.
결측치가 있는지 물어보는 함수를 통해 확인해 보자.¶
data.isna()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | True | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | False | False | False | False | False | False | False | False | False | False |
996 | False | False | False | False | False | False | False | False | False | False |
997 | False | False | False | False | False | False | False | False | False | False |
998 | False | False | False | False | False | False | False | False | False | False |
999 | False | False | False | False | False | False | False | False | False | False |
1000 rows × 10 columns
bullean이 나왔다. Yes or No 형태로 나오는 형태. 이 정보로는 확인하기 어렵다. 그래서 다른 함수를 적용해서 우리가 보고자 하는 정보로 바꾸자.
- False = 0
- True = 1
data
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
data.sum()
Daily Time Spent on Site 65000.2 Age 33094 Area Income 5.5e+07 Daily Internet Usage 180000 Ad Topic Line Cloned 5thgeneration orchestrationMonitored na... City WrightburghWest JodiDavidtonWest TerrifurtSout... Male 481 Country TunisiaNauruSan MarinoItalyIcelandNorwayMyanma... Timestamp 3/27/2016 0:534/4/2016 1:393/13/2016 20:351/10... Clicked on Ad 500 dtype: object
.sum() 함수를 쓰면 각 컬럼의 합이 나온다. 이것을 응용해서 data.isna()뒤에 .sum() 붙이자.
data.isna().sum()
Daily Time Spent on Site 0 Age 84 Area Income 0 Daily Internet Usage 0 Ad Topic Line 0 City 0 Male 0 Country 0 Timestamp 0 Clicked on Ad 0 dtype: int64
False = 0 와 True = 1 이 덧셈으로 위와 같은 결과가 나온다.
- 그 결과 Age에 84개의 결측치가 있는 것을 확인 할 수 있다.
len(data)
1000
data.isna().sum() / len(data)
Daily Time Spent on Site 0.000 Age 0.084 Area Income 0.000 Daily Internet Usage 0.000 Ad Topic Line 0.000 City 0.000 Male 0.000 Country 0.000 Timestamp 0.000 Clicked on Ad 0.000 dtype: float64
총 1000개의 데이터로 나누어 결츠리를 % 로 본 결과 이다.
data.dropna()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
5 | 59.99 | 23.0 | 59761.56 | 226.74 | Sharable client-driven software | Jamieberg | 1 | Norway | 5/19/2016 14:30 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
916 rows × 10 columns
data
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
data 를 업데이트를 안해주었기 때문에 data로 불러오면 아직 1000개로 나온다. 업데이트 하는 방법은 아래와 같다.
- data.dropna(inplace=True) 지금은 라인을 지우는 방법은 쓰지 않을 것이기에 업데이트 하지는 않겠다. 행(라인)을 지우게 되면 다른 중요한 정보도 잃어 버릴수가 있기 때문이다.
- 컬럼(특성)을 제거 하는 방법 : .drop()
data.drop('Age', axis=1)
data
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
- 채워주는 방법 : 평균
data['Age'].mean()
36.12882096069869
data['Age'].median()
35.0
data.fillna(data['Age'].mean())
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | 36.128821 | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.000000 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.000000 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.000000 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.000000 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.000000 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.000000 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.000000 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.000000 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.000000 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
나이가 소수점으로 인해, 깔끔하지가 않다. 소수점을 지우자. round() 사용
round(3.3) # round() 예시
3
round(data['Age'].mean())
36
data.fillna(round(data['Age'].mean()))
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | 36.0 | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
data
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
data를 불러오면 위에 data.fillna(round(data['Age'].mean())) 로 한 값으로 결측치가 채워지지 않고 NaN값으로 나온다. 업데이트를 해주자.
data = data.fillna(round(data['Age'].mean()))
data
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | 36.0 | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
- 이제 평균값으로 결측치를 채웠고, 업데이트도 완료 했다. 이제 Age의 결측값이 정말 다 채워졌는지 data.isna().sum() 를 써서 확인하자.
data.isna().sum()
Daily Time Spent on Site 0 Age 0 Area Income 0 Daily Internet Usage 0 Ad Topic Line 0 City 0 Male 0 Country 0 Timestamp 0 Clicked on Ad 0 dtype: int64
전부 0으로 나온다. 결측값을 다 채웠다.
from sklearn.model_selection import train_test_split
X = data[['Daily Time Spent on Site','Age','Area Income','Daily Internet Usage','Male']] # 독립변수
y = data['Clicked on Ad'] # 알고자하는 종속변수
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100) # train 과 test 비율 8:2
X_train
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | |
---|---|---|---|---|---|
675 | 82.58 | 38.0 | 65496.78 | 225.23 | 1 |
358 | 51.38 | 59.0 | 42362.49 | 158.56 | 0 |
159 | 75.55 | 36.0 | 73234.87 | 159.24 | 0 |
533 | 91.43 | 36.0 | 46964.11 | 209.91 | 1 |
678 | 87.85 | 34.0 | 51816.27 | 153.01 | 0 |
... | ... | ... | ... | ... | ... |
855 | 50.87 | 24.0 | 62939.50 | 190.41 | 0 |
871 | 76.79 | 27.0 | 55677.12 | 235.94 | 0 |
835 | 63.11 | 34.0 | 63107.88 | 254.94 | 1 |
792 | 56.56 | 26.0 | 68783.45 | 204.47 | 1 |
520 | 46.61 | 42.0 | 65856.74 | 136.18 | 0 |
800 rows × 5 columns
X_test
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | |
---|---|---|---|---|---|
249 | 62.20 | 25.0 | 25408.21 | 161.16 | 0 |
353 | 79.54 | 44.0 | 70492.60 | 217.68 | 1 |
537 | 61.72 | 26.0 | 67279.06 | 218.49 | 0 |
424 | 43.59 | 36.0 | 58849.77 | 132.31 | 1 |
564 | 64.75 | 36.0 | 63001.03 | 117.66 | 0 |
... | ... | ... | ... | ... | ... |
684 | 42.06 | 34.0 | 43241.19 | 131.55 | 0 |
644 | 78.35 | 46.0 | 53185.34 | 253.48 | 0 |
110 | 66.63 | 60.0 | 60333.38 | 176.98 | 0 |
28 | 70.20 | 34.0 | 32708.94 | 119.20 | 0 |
804 | 53.92 | 41.0 | 25739.09 | 125.46 | 1 |
200 rows × 5 columns
train 데이터와 test 데이터가 준비가 되었으니, 다음은 모델링 파트로 넘어가 보자.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression() # 모델이라는 이름아래 LogisticRegression 함수를 장착
model.fit(X_train, y_train)
LogisticRegression()
model.coef_
array([[-6.64737762e-02, 2.66015818e-01, -1.15501902e-05, -2.44285539e-02, 2.00758165e-03]])
pred = model.predict(X_test) # 평가를 위해 준비해 둔 데이터인 X_test 를 넣어준다.
pred
array([0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])
y_test
249 1 353 0 537 0 424 1 564 1 .. 684 1 644 0 110 1 28 1 804 1 Name: Clicked on Ad, Length: 200, dtype: int64
X_test 와 y_test 를 눈으로 직접 비교해서 얼마나 맞고 안맞고를 알 수 있지만, 이것 또한 해결해 주는 함수가 있다.
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, pred)
0.9
정확도가 90%가 나왔다. 좋은 예측점수가 나왔다.
confusion_matrix(y_test, pred)
array([[92, 8], [12, 88]])
파이썬 스킬 Tip¶
컬럼(특성) 다루기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data =pd.read_csv('./data/advertising.csv')
data
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
1000 rows × 10 columns
data['Country'].nunique()
237
data['Country'].unique()
array(['Tunisia', 'Nauru', 'San Marino', 'Italy', 'Iceland', 'Norway', 'Myanmar', 'Australia', 'Grenada', 'Ghana', 'Qatar', 'Burundi', 'Egypt', 'Bosnia and Herzegovina', 'Barbados', 'Spain', 'Palestinian Territory', 'Afghanistan', 'British Indian Ocean Territory (Chagos Archipelago)', 'Russian Federation', 'Cameroon', 'Korea', 'Tokelau', 'Monaco', 'Tuvalu', 'Greece', 'British Virgin Islands', 'Bouvet Island (Bouvetoya)', 'Peru', 'Aruba', 'Maldives', 'Senegal', 'Dominica', 'Luxembourg', 'Montenegro', 'Ukraine', 'Saint Helena', 'Liberia', 'Turkmenistan', 'Niger', 'Sri Lanka', 'Trinidad and Tobago', 'United Kingdom', 'Guinea-Bissau', 'Micronesia', 'Turkey', 'Croatia', 'Israel', 'Svalbard & Jan Mayen Islands', 'Azerbaijan', 'Iran', 'Saint Vincent and the Grenadines', 'Bulgaria', 'Christmas Island', 'Canada', 'Rwanda', 'Turks and Caicos Islands', 'Norfolk Island', 'Cook Islands', 'Guatemala', "Cote d'Ivoire", 'Faroe Islands', 'Ireland', 'Moldova', 'Nicaragua', 'Montserrat', 'Timor-Leste', 'Puerto Rico', 'Central African Republic', 'Venezuela', 'Wallis and Futuna', 'Jersey', 'Samoa', 'Antarctica (the territory South of 60 deg S)', 'Albania', 'Hong Kong', 'Lithuania', 'Bangladesh', 'Western Sahara', 'Serbia', 'Czech Republic', 'Guernsey', 'Tanzania', 'Bhutan', 'Guinea', 'Madagascar', 'Lebanon', 'Eritrea', 'Guyana', 'United Arab Emirates', 'Martinique', 'Somalia', 'Benin', 'Papua New Guinea', 'Uzbekistan', 'South Africa', 'Hungary', 'Falkland Islands (Malvinas)', 'Saint Martin', 'Cuba', 'United States Minor Outlying Islands', 'Belize', 'Kuwait', 'Thailand', 'Gibraltar', 'Holy See (Vatican City State)', 'Netherlands', 'Belarus', 'New Zealand', 'Togo', 'Kenya', 'Palau', 'Cambodia', 'Costa Rica', 'Liechtenstein', 'Angola', 'Equatorial Guinea', 'Mongolia', 'Brazil', 'Chad', 'Portugal', 'Malawi', 'Singapore', 'Kazakhstan', 'China', 'Vietnam', 'Mayotte', 'Jamaica', 'Bahamas', 'Algeria', 'Fiji', 'Argentina', 'Philippines', 'Suriname', 'Guam', 'Antigua and Barbuda', 'Georgia', 'Jordan', 'Saudi Arabia', 'Sao Tome and Principe', 'Cyprus', 'Kyrgyz Republic', 'Pakistan', 'Seychelles', 'Mauritania', 'Chile', 'Poland', 'Estonia', 'Latvia', 'Bahrain', 'Colombia', 'Brunei Darussalam', 'Taiwan', 'Saint Pierre and Miquelon', 'Finland', 'French Southern Territories', 'Sierra Leone', 'Tajikistan', 'Ecuador', 'Switzerland', 'France', 'Malaysia', 'Mauritius', 'Japan', 'Greenland', 'Guadeloupe', 'Belgium', 'Honduras', 'Paraguay', 'French Guiana', 'Northern Mariana Islands', 'American Samoa', 'Austria', 'Tonga', 'New Caledonia', 'United States of America', 'Morocco', 'Macedonia', 'Gabon', 'Uganda', 'Saint Lucia', 'Niue', 'Zambia', 'Congo', 'Pitcairn Islands', 'Anguilla', 'Sweden', 'Indonesia', 'Mexico', 'Haiti', 'Gambia', 'El Salvador', 'Libyan Arab Jamahiriya', 'Saint Barthelemy', 'Reunion', 'Panama', 'Dominican Republic', 'Zimbabwe', 'Swaziland', 'Saint Kitts and Nevis', 'Burkina Faso', 'Heard Island and McDonald Islands', 'Bolivia', 'Netherlands Antilles', 'French Polynesia', 'Germany', 'Malta', 'Sudan', "Lao People's Democratic Republic", 'Isle of Man', 'Macao', 'United States Virgin Islands', 'Djibouti', 'Mali', 'Romania', 'Cayman Islands', 'Ethiopia', 'Uruguay', 'Comoros', 'Vanuatu', 'Nepal', 'Yemen', 'India', 'Cape Verde', 'Slovenia', 'Denmark', 'Syrian Arab Republic', 'Andorra', 'Namibia', 'Slovakia (Slovak Republic)', 'Armenia', 'South Georgia and the South Sandwich Islands', 'Kiribati', 'Marshall Islands', 'Bermuda', 'Mozambique', 'Lesotho'], dtype=object)
중복된 항목을 보여주는 함수.
data['Country'].value_counts()
France 9 Czech Republic 9 Australia 8 Turkey 8 South Africa 8 .. Aruba 1 Saint Kitts and Nevis 1 Cape Verde 1 Montserrat 1 Bermuda 1 Name: Country, Length: 237, dtype: int64
data['Country'].value_counts().head(30)
France 9 Czech Republic 9 Australia 8 Turkey 8 South Africa 8 Micronesia 8 Afghanistan 8 Cyprus 8 Liberia 8 Peru 8 Senegal 8 Greece 8 Taiwan 7 Bahamas 7 Burundi 7 Ethiopia 7 Eritrea 7 Cambodia 7 Albania 7 Venezuela 7 Western Sahara 7 Fiji 7 Luxembourg 7 Bosnia and Herzegovina 7 Zimbabwe 6 Mongolia 6 Hungary 6 Belarus 6 Algeria 6 Qatar 6 Name: Country, dtype: int64
data['Country'].value_counts().tail(30)
Comoros 2 Reunion 2 Pitcairn Islands 2 Sao Tome and Principe 2 Andorra 2 Djibouti 2 South Georgia and the South Sandwich Islands 2 Mauritania 2 Slovakia (Slovak Republic) 2 Norway 2 Bhutan 2 Benin 2 Central African Republic 2 Uzbekistan 2 Haiti 2 Guinea-Bissau 2 Lesotho 1 Slovenia 1 Mozambique 1 Romania 1 Kiribati 1 Germany 1 Marshall Islands 1 Jordan 1 British Indian Ocean Territory (Chagos Archipelago) 1 Aruba 1 Saint Kitts and Nevis 1 Cape Verde 1 Montserrat 1 Bermuda 1 Name: Country, dtype: int64
- 출처: fast campus_파이썬을 활용한 이커머스 데이터 분석
'파이썬을 활용한 이커머스 데이터 분석' 카테고리의 다른 글
Chapter.07 고객 분류 (Kmeans) (0) | 2021.06.14 |
---|---|
Chapter.06 프로모션 효율 예측 (Random Forest) (0) | 2021.06.13 |
Chapter05.구매 요인 분석(Dicision Tree) (0) | 2021.06.12 |
Chapter04.KNN (0) | 2021.06.10 |
Chapter02. 고객별 연간 지출액 예측 (Linear Regression) (0) | 2021.06.08 |