Chapter03.Logistic Regression

파이썬을 활용한 이커머스 데이터분석_강의를 듣고 따라했던 코딩과 요점을 정리하였다.

출처: fast campus

Chapter03. 광고 반응률 예측 (Logistic Regression)¶

분석의 목적¶

Logistic Regression 은 Linear Regression을 기반으로 만들어진 모델인데, 차이점이 있다면

Linear Regression 은 (연간 지출액 등..) 연속된 숫자의 어딘가를 예측하는 알고리즘
Logistic Regression 은 Yes or No 두가지 중 어디에 속하는지, 이진분류를예측하는 머신러닝 알고리즘이다.

우리가 다를 데이터는 광고관련 데이터이며, y 값은 이 광고를 클릭을 했는지 안했는지 / input 데이터는 성별,나이 등등을 이용할 것이다.

데이터 불러오기¶

In [1]:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

pd.read_csv('./data/advertising.csv')

Out[2]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

.read_csv 명령어로 데이터만을 불러 올 수 있다. 그냥 불로오기만 했을뿐이다. 그래서 data = 추가해서 'data' 라고 정해주자.

In [3]:

data = pd.read_csv('./data/advertising.csv')

In [4]:

data

Out[4]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

그리고, 데이터를 확인하기 위한 함수를 사용하여 살펴보자.

In [5]:

data.head(10)

Out[5]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
5	59.99	23.0	59761.56	226.74	Sharable client-driven software	Jamieberg	1	Norway	5/19/2016 14:30	0
6	88.91	NaN	53852.85	208.36	Enhanced dedicated support	Brandonstad	0	Myanmar	1/28/2016 20:59	0
7	66.00	48.0	24593.33	131.76	Reactive local challenge	Port Jefferybury	1	Australia	3/7/2016 1:40	1
8	74.53	30.0	68862.00	221.51	Configurable coherent function	West Colin	1	Grenada	4/18/2016 9:33	0
9	69.88	20.0	55642.32	183.82	Mandatory homogeneous architecture	Ramirezton	1	Ghana	7/11/2016 1:42	0

특성(컬럼) 도메인 확인하기¶

Clicked on AD 가 우리가 알고자하는 종속변수이다. 이 사람이 광고를 클릭했는지 안했는지. 예측하는 것 ['0' 클릭했다 , '1'클릭은 안했다]
Daily Time Spent on Site : 이 사이트에서 시간을 얼마나 보냈는지
Age : 나이 (NaN:결측치)
Area Income : 그 지역에 대한 소득 (개인에 대한 소득은 금융권 데이터가 아닌이상 알기가 힘들다.)
Daily Internet Usage : 인터넷을 하루에 얼마나 쓰는지
Ad Topic Line : 광고에 대한 설명 (종속변수를 확인하는데에 중요하지 않을 것으로 판단된다.)
City : 도시
Male : 성별 ['0' 여자 , '1'은 남자] 한번 가공이 된 데이터 이다.
Country : 나라
Timestamp : 시간과 관련된 데이터

In [6]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       916 non-null    float64
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(4), int64(2), object(4)
memory usage: 78.2+ KB

Age 에 null값(결측치)가 있음을 확인 할 수 있다. float64는 소수점이 있는 숫자 , int64 는 소수점이 없는 숫자, object는 텍스트 이다.

In [7]:

data.describe()

Out[7]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.000000	916.000000	1000.000000	1000.000000	1000.000000	1000.00000
mean	65.000200	36.128821	55000.000080	180.000100	0.481000	0.50000
std	15.853615	9.018548	13414.634022	43.902339	0.499889	0.50025
min	32.600000	19.000000	13996.500000	104.780000	0.000000	0.00000
25%	51.360000	29.000000	47031.802500	138.830000	0.000000	0.00000
50%	68.215000	35.000000	57012.300000	183.130000	0.000000	0.50000
75%	78.547500	42.000000	65470.635000	218.792500	1.000000	1.00000
max	91.430000	61.000000	79484.800000	269.960000	1.000000	1.00000

인덱싱을 해보자

In [8]:

data['Area Income']

Out[8]:

0      61833.90
1      68441.85
2      59785.94
3      54806.18
4      73889.99
         ...   
995    71384.57
996    67782.17
997    42415.72
998    41920.79
999    29875.80
Name: Area Income, Length: 1000, dtype: float64

인덱싱을 한것을 시각화 해본다.

In [9]:

sns.distplot(data['Area Income'])

/home/ubuntu/.local/lib/python3.6/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Out[9]:

<AxesSubplot:xlabel='Area Income', ylabel='Density'>

In [10]:

sns.distplot(data['Age'])

/home/ubuntu/.local/lib/python3.6/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Out[10]:

<AxesSubplot:xlabel='Age', ylabel='Density'>

Country 가 얼마나 있는지 확인해 보자.

In [11]:

data['Country'].nunique()

Out[11]:

237개의 국가가 있다.

In [12]:

data['City'].nunique()

Out[12]:

969개의 도시가 있다. 1000개 중에 거의 중복된 도시가 없다. 라고 해석할 수 있다.

In [13]:

data['Ad Topic Line'].nunique()

Out[13]:

1000개 중에 1000이란 값이 나왔다. 고유값이 전부다 다르다. 여기 과정에서는 Drop하고 진행하고자 한다. 이 방법이 정답은 아니다.

결측치가 있는지 물어보는 함수를 통해 확인해 보자.¶

In [14]:

data.isna()

Out[14]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	False	True	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...
995	False	False	False	False	False	False	False	False	False	False
996	False	False	False	False	False	False	False	False	False	False
997	False	False	False	False	False	False	False	False	False	False
998	False	False	False	False	False	False	False	False	False	False
999	False	False	False	False	False	False	False	False	False	False

1000 rows × 10 columns

bullean이 나왔다. Yes or No 형태로 나오는 형태. 이 정보로는 확인하기 어렵다. 그래서 다른 함수를 적용해서 우리가 보고자 하는 정보로 바꾸자.

False = 0
True = 1

In [15]:

data

Out[15]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

In [16]:

data.sum()

Out[16]:

Daily Time Spent on Site                                              65000.2
Age                                                                     33094
Area Income                                                           5.5e+07
Daily Internet Usage                                                   180000
Ad Topic Line               Cloned 5thgeneration orchestrationMonitored na...
City                        WrightburghWest JodiDavidtonWest TerrifurtSout...
Male                                                                      481
Country                     TunisiaNauruSan MarinoItalyIcelandNorwayMyanma...
Timestamp                   3/27/2016 0:534/4/2016 1:393/13/2016 20:351/10...
Clicked on Ad                                                             500
dtype: object

.sum() 함수를 쓰면 각 컬럼의 합이 나온다. 이것을 응용해서 data.isna()뒤에 .sum() 붙이자.

In [17]:

data.isna().sum()

Out[17]:

Daily Time Spent on Site     0
Age                         84
Area Income                  0
Daily Internet Usage         0
Ad Topic Line                0
City                         0
Male                         0
Country                      0
Timestamp                    0
Clicked on Ad                0
dtype: int64

False = 0 와 True = 1 이 덧셈으로 위와 같은 결과가 나온다.

그 결과 Age에 84개의 결측치가 있는 것을 확인 할 수 있다.

In [18]:

len(data)

Out[18]:

In [19]:

data.isna().sum() / len(data)

Out[19]:

Daily Time Spent on Site    0.000
Age                         0.084
Area Income                 0.000
Daily Internet Usage        0.000
Ad Topic Line               0.000
City                        0.000
Male                        0.000
Country                     0.000
Timestamp                   0.000
Clicked on Ad               0.000
dtype: float64

총 1000개의 데이터로 나누어 결츠리를 % 로 본 결과 이다.

결측치 처리하는 방법 (inpute 하는 방법)¶

라인이나, 컬럼을 제거하는방법
채워주는 방법
그 자체로 보존하는 방법

1. 제거하는 방법¶

라인을 제거하는 방법 : .dropna()

In [20]:

data.dropna()

Out[20]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
5	59.99	23.0	59761.56	226.74	Sharable client-driven software	Jamieberg	1	Norway	5/19/2016 14:30	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

916 rows × 10 columns

In [21]:

data

Out[21]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

data 를 업데이트를 안해주었기 때문에 data로 불러오면 아직 1000개로 나온다. 업데이트 하는 방법은 아래와 같다.

data.dropna(inplace=True) 지금은 라인을 지우는 방법은 쓰지 않을 것이기에 업데이트 하지는 않겠다. 행(라인)을 지우게 되면 다른 중요한 정보도 잃어 버릴수가 있기 때문이다.

컬럼(특성)을 제거 하는 방법 : .drop()

data.drop('Age', axis=1)

In [22]:

data

Out[22]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

채워주는 방법 : 평균

In [23]:

data['Age'].mean()

Out[23]:

36.12882096069869

In [24]:

data['Age'].median()

Out[24]:

35.0

In [25]:

data.fillna(data['Age'].mean())

Out[25]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	36.128821	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.000000	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.000000	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.000000	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.000000	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.000000	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.000000	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.000000	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.000000	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.000000	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

나이가 소수점으로 인해, 깔끔하지가 않다. 소수점을 지우자. round() 사용

In [26]:

round(3.3)  # round() 예시

Out[26]:

In [27]:

round(data['Age'].mean())

Out[27]:

In [28]:

data.fillna(round(data['Age'].mean()))

Out[28]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	36.0	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

In [29]:

data

Out[29]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

data를 불러오면 위에 data.fillna(round(data['Age'].mean())) 로 한 값으로 결측치가 채워지지 않고 NaN값으로 나온다. 업데이트를 해주자.

In [30]:

data = data.fillna(round(data['Age'].mean()))

In [31]:

data

Out[31]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	36.0	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

이제 평균값으로 결측치를 채웠고, 업데이트도 완료 했다. 이제 Age의 결측값이 정말 다 채워졌는지 data.isna().sum() 를 써서 확인하자.

In [32]:

data.isna().sum()

Out[32]:

Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               0
dtype: int64

전부 0으로 나온다. 결측값을 다 채웠다.

In [33]:

from sklearn.model_selection import train_test_split

In [36]:

X = data[['Daily Time Spent on Site','Age','Area Income','Daily Internet Usage','Male']]  # 독립변수
y = data['Clicked on Ad']      # 알고자하는 종속변수

In [42]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)   # train 과 test 비율 8:2

In [43]:

X_train

Out[43]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male
675	82.58	38.0	65496.78	225.23	1
358	51.38	59.0	42362.49	158.56	0
159	75.55	36.0	73234.87	159.24	0
533	91.43	36.0	46964.11	209.91	1
678	87.85	34.0	51816.27	153.01	0
...	...	...	...	...	...
855	50.87	24.0	62939.50	190.41	0
871	76.79	27.0	55677.12	235.94	0
835	63.11	34.0	63107.88	254.94	1
792	56.56	26.0	68783.45	204.47	1
520	46.61	42.0	65856.74	136.18	0

800 rows × 5 columns

In [44]:

X_test

Out[44]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male
249	62.20	25.0	25408.21	161.16	0
353	79.54	44.0	70492.60	217.68	1
537	61.72	26.0	67279.06	218.49	0
424	43.59	36.0	58849.77	132.31	1
564	64.75	36.0	63001.03	117.66	0
...	...	...	...	...	...
684	42.06	34.0	43241.19	131.55	0
644	78.35	46.0	53185.34	253.48	0
110	66.63	60.0	60333.38	176.98	0
28	70.20	34.0	32708.94	119.20	0
804	53.92	41.0	25739.09	125.46	1

200 rows × 5 columns

train 데이터와 test 데이터가 준비가 되었으니, 다음은 모델링 파트로 넘어가 보자.

In [46]:

from sklearn.linear_model import LogisticRegression

In [48]:

model = LogisticRegression()  # 모델이라는 이름아래 LogisticRegression 함수를 장착

In [49]:

model.fit(X_train, y_train)

Out[49]:

LogisticRegression()

In [50]:

model.coef_

Out[50]:

array([[-6.64737762e-02,  2.66015818e-01, -1.15501902e-05,
        -2.44285539e-02,  2.00758165e-03]])

In [58]:

pred = model.predict(X_test)  # 평가를 위해 준비해 둔 데이터인 X_test 를 넣어준다.

In [59]:

pred

Out[59]:

array([0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1])

In [55]:

y_test

Out[55]:

249    1
353    0
537    0
424    1
564    1
      ..
684    1
644    0
110    1
28     1
804    1
Name: Clicked on Ad, Length: 200, dtype: int64

X_test 와 y_test 를 눈으로 직접 비교해서 얼마나 맞고 안맞고를 알 수 있지만, 이것 또한 해결해 주는 함수가 있다.

In [56]:

from sklearn.metrics import accuracy_score, confusion_matrix

In [61]:

accuracy_score(y_test, pred)

Out[61]:

0.9

정확도가 90%가 나왔다. 좋은 예측점수가 나왔다.

In [62]:

confusion_matrix(y_test, pred)

Out[62]:

array([[92,  8],
       [12, 88]])

파이썬 스킬 Tip¶

컬럼(특성) 다루기

In [63]:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [65]:

data =pd.read_csv('./data/advertising.csv')

In [66]:

data

Out[66]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53	0
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39	0
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35	0
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31	0
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36	0
...	...	...	...	...	...	...	...	...	...	...
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

1000 rows × 10 columns

In [67]:

data['Country'].nunique()

Out[67]:

In [69]:

data['Country'].unique()

Out[69]:

array(['Tunisia', 'Nauru', 'San Marino', 'Italy', 'Iceland', 'Norway',
       'Myanmar', 'Australia', 'Grenada', 'Ghana', 'Qatar', 'Burundi',
       'Egypt', 'Bosnia and Herzegovina', 'Barbados', 'Spain',
       'Palestinian Territory', 'Afghanistan',
       'British Indian Ocean Territory (Chagos Archipelago)',
       'Russian Federation', 'Cameroon', 'Korea', 'Tokelau', 'Monaco',
       'Tuvalu', 'Greece', 'British Virgin Islands',
       'Bouvet Island (Bouvetoya)', 'Peru', 'Aruba', 'Maldives',
       'Senegal', 'Dominica', 'Luxembourg', 'Montenegro', 'Ukraine',
       'Saint Helena', 'Liberia', 'Turkmenistan', 'Niger', 'Sri Lanka',
       'Trinidad and Tobago', 'United Kingdom', 'Guinea-Bissau',
       'Micronesia', 'Turkey', 'Croatia', 'Israel',
       'Svalbard & Jan Mayen Islands', 'Azerbaijan', 'Iran',
       'Saint Vincent and the Grenadines', 'Bulgaria', 'Christmas Island',
       'Canada', 'Rwanda', 'Turks and Caicos Islands', 'Norfolk Island',
       'Cook Islands', 'Guatemala', "Cote d'Ivoire", 'Faroe Islands',
       'Ireland', 'Moldova', 'Nicaragua', 'Montserrat', 'Timor-Leste',
       'Puerto Rico', 'Central African Republic', 'Venezuela',
       'Wallis and Futuna', 'Jersey', 'Samoa',
       'Antarctica (the territory South of 60 deg S)', 'Albania',
       'Hong Kong', 'Lithuania', 'Bangladesh', 'Western Sahara', 'Serbia',
       'Czech Republic', 'Guernsey', 'Tanzania', 'Bhutan', 'Guinea',
       'Madagascar', 'Lebanon', 'Eritrea', 'Guyana',
       'United Arab Emirates', 'Martinique', 'Somalia', 'Benin',
       'Papua New Guinea', 'Uzbekistan', 'South Africa', 'Hungary',
       'Falkland Islands (Malvinas)', 'Saint Martin', 'Cuba',
       'United States Minor Outlying Islands', 'Belize', 'Kuwait',
       'Thailand', 'Gibraltar', 'Holy See (Vatican City State)',
       'Netherlands', 'Belarus', 'New Zealand', 'Togo', 'Kenya', 'Palau',
       'Cambodia', 'Costa Rica', 'Liechtenstein', 'Angola',
       'Equatorial Guinea', 'Mongolia', 'Brazil', 'Chad', 'Portugal',
       'Malawi', 'Singapore', 'Kazakhstan', 'China', 'Vietnam', 'Mayotte',
       'Jamaica', 'Bahamas', 'Algeria', 'Fiji', 'Argentina',
       'Philippines', 'Suriname', 'Guam', 'Antigua and Barbuda',
       'Georgia', 'Jordan', 'Saudi Arabia', 'Sao Tome and Principe',
       'Cyprus', 'Kyrgyz Republic', 'Pakistan', 'Seychelles',
       'Mauritania', 'Chile', 'Poland', 'Estonia', 'Latvia', 'Bahrain',
       'Colombia', 'Brunei Darussalam', 'Taiwan',
       'Saint Pierre and Miquelon', 'Finland',
       'French Southern Territories', 'Sierra Leone', 'Tajikistan',
       'Ecuador', 'Switzerland', 'France', 'Malaysia', 'Mauritius',
       'Japan', 'Greenland', 'Guadeloupe', 'Belgium', 'Honduras',
       'Paraguay', 'French Guiana', 'Northern Mariana Islands',
       'American Samoa', 'Austria', 'Tonga', 'New Caledonia',
       'United States of America', 'Morocco', 'Macedonia', 'Gabon',
       'Uganda', 'Saint Lucia', 'Niue', 'Zambia', 'Congo',
       'Pitcairn Islands', 'Anguilla', 'Sweden', 'Indonesia', 'Mexico',
       'Haiti', 'Gambia', 'El Salvador', 'Libyan Arab Jamahiriya',
       'Saint Barthelemy', 'Reunion', 'Panama', 'Dominican Republic',
       'Zimbabwe', 'Swaziland', 'Saint Kitts and Nevis', 'Burkina Faso',
       'Heard Island and McDonald Islands', 'Bolivia',
       'Netherlands Antilles', 'French Polynesia', 'Germany', 'Malta',
       'Sudan', "Lao People's Democratic Republic", 'Isle of Man',
       'Macao', 'United States Virgin Islands', 'Djibouti', 'Mali',
       'Romania', 'Cayman Islands', 'Ethiopia', 'Uruguay', 'Comoros',
       'Vanuatu', 'Nepal', 'Yemen', 'India', 'Cape Verde', 'Slovenia',
       'Denmark', 'Syrian Arab Republic', 'Andorra', 'Namibia',
       'Slovakia (Slovak Republic)', 'Armenia',
       'South Georgia and the South Sandwich Islands', 'Kiribati',
       'Marshall Islands', 'Bermuda', 'Mozambique', 'Lesotho'],
      dtype=object)

중복된 항목을 보여주는 함수.

In [72]:

data['Country'].value_counts()

Out[72]:

France                   9
Czech Republic           9
Australia                8
Turkey                   8
South Africa             8
                        ..
Aruba                    1
Saint Kitts and Nevis    1
Cape Verde               1
Montserrat               1
Bermuda                  1
Name: Country, Length: 237, dtype: int64

In [70]:

data['Country'].value_counts().head(30)  

Out[70]:

France                    9
Czech Republic            9
Australia                 8
Turkey                    8
South Africa              8
Micronesia                8
Afghanistan               8
Cyprus                    8
Liberia                   8
Peru                      8
Senegal                   8
Greece                    8
Taiwan                    7
Bahamas                   7
Burundi                   7
Ethiopia                  7
Eritrea                   7
Cambodia                  7
Albania                   7
Venezuela                 7
Western Sahara            7
Fiji                      7
Luxembourg                7
Bosnia and Herzegovina    7
Zimbabwe                  6
Mongolia                  6
Hungary                   6
Belarus                   6
Algeria                   6
Qatar                     6
Name: Country, dtype: int64

In [71]:

data['Country'].value_counts().tail(30)  

Out[71]:

Comoros                                                2
Reunion                                                2
Pitcairn Islands                                       2
Sao Tome and Principe                                  2
Andorra                                                2
Djibouti                                               2
South Georgia and the South Sandwich Islands           2
Mauritania                                             2
Slovakia (Slovak Republic)                             2
Norway                                                 2
Bhutan                                                 2
Benin                                                  2
Central African Republic                               2
Uzbekistan                                             2
Haiti                                                  2
Guinea-Bissau                                          2
Lesotho                                                1
Slovenia                                               1
Mozambique                                             1
Romania                                                1
Kiribati                                               1
Germany                                                1
Marshall Islands                                       1
Jordan                                                 1
British Indian Ocean Territory (Chagos Archipelago)    1
Aruba                                                  1
Saint Kitts and Nevis                                  1
Cape Verde                                             1
Montserrat                                             1
Bermuda                                                1
Name: Country, dtype: int64

출처: fast campus_파이썬을 활용한 이커머스 데이터 분석

In [ ]:

'파이썬을 활용한 이커머스 데이터 분석' 카테고리의 다른 글

Chapter.07 고객 분류 (Kmeans) (0)	2021.06.14
Chapter.06 프로모션 효율 예측 (Random Forest) (0)	2021.06.13
Chapter05.구매 요인 분석(Dicision Tree) (0)	2021.06.12
Chapter04.KNN (0)	2021.06.10
Chapter02. 고객별 연간 지출액 예측 (Linear Regression) (0)	2021.06.08

미래를 위한 취미

Chapter03. 광고 반응률 예측 (Logistic Regression)

Chapter03. 광고 반응률 예측 (Logistic Regression)¶

분석의 목적¶

데이터 불러오기¶

특성(컬럼) 도메인 확인하기¶

결측치가 있는지 물어보는 함수를 통해 확인해 보자.¶

결측치 처리하는 방법 (inpute 하는 방법)¶

1. 제거하는 방법¶

파이썬 스킬 Tip¶

'파이썬을 활용한 이커머스 데이터 분석' 카테고리의 다른 글

티스토리툴바

Chapter03. 광고 반응률 예측 (Logistic Regression)

Chapter03. 광고 반응률 예측 (Logistic Regression)¶

분석의 목적¶

데이터 불러오기¶

특성(컬럼) 도메인 확인하기¶

결측치가 있는지 물어보는 함수를 통해 확인해 보자.¶

결측치 처리하는 방법 (inpute 하는 방법)¶

1. 제거하는 방법¶

파이썬 스킬 Tip¶

'파이썬을 활용한 이커머스 데이터 분석' 카테고리의 다른 글

'파이썬을 활용한 이커머스 데이터 분석' Related Articles

티스토리툴바