chapter2 pandas 시작하기¶

판다스는 데이터프레임과 시르즈라는 자료형과 데이터 분석을 위한 다양한 기능을 제공하는 파이썬 라이브러리 입니다. 또 판다스는 파이썬 언어만 사용할 줄 알아도 데이터 분석을 바로 시작할 수 있을뿐만 아니라 반복되는 데이터 분석 작업을 프로그램으로 만들어 쉽게 해결 할 수 있다는 장점이 있습니다. 이번 장에서는 판다스의 기초 개념을 정리하고 몇 가지 간단한 실습을 통해 판다스가 어떻게 동작하는지 알아보겠습니다. 목차는 다음과 같습니다.

2-1 데이터 집합 불러오기
2-2 데이터 추출하기
2-3 기초적인 통계 계산하기
2-4 그래프 그리기

2-1 데이터 집합 불러오기¶

데이터 분석의 시작은 데이터 불러오기부터¶

데이터 분석을 위해 가장 먼저 해야 할 일은 무엇일까요?

갭마인더 데이터 집합 불러오기¶

1.¶

판다스의 여러 기능을 사용하려면 판다스 라이브러리를 불러와야 합니다. 다음과 같이 입력하여 판다스 라이브러리를 불러오세요.

In [4]:

import pandas

2.¶

갭마인더 데이터 집합을 불러오려면 read_csv 메서드를 사용해야 합니다. read_csv메서드는 기본적으로 쉼표(,)로 열이 구분되어 있는 데이터를 불러옵니다. 하지만 갭마인더는 열이 탭(tap)으로 구분되어 있기 때문에 read_csv 메서드를 호출할 때 열이 탭으로 구분되어 있다고 미리 알려주어야 합니다.sep 속성값으로 \t를 지정하세요.

In [5]:

df = pandas.read_csv('./data/gapminder.tsv', sep = '\t')
df

Out[5]:

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106
...	...	...	...	...	...	...
1699	Zimbabwe	Africa	1987	62.351	9216418	706.157306
1700	Zimbabwe	Africa	1992	60.377	10704340	693.420786
1701	Zimbabwe	Africa	1997	46.809	11404948	792.449960
1702	Zimbabwe	Africa	2002	39.989	11926563	672.038623
1703	Zimbabwe	Africa	2007	43.487	12311143	469.709298

1704 rows × 6 columns

3.¶

판다스에 있는 메서드를 호출하려면 pandas와 점(,) 연산자를 사용해야 합니다. 그런데 매번 pandas라고 입력하려면 번거롭겠죠. 그래서 이를 해결하기 위해 관습적으로 pandas를 pd로 줄여 사용합니다. 다음과 같이 입력하면 pandas를 pd로 줄여 사용할 수 있습니다. 앞으로는 이 방법을 사용하겠습니다.

In [6]:

import pandas as pd
df = pd.read_csv('./data/gapminder.tsv', sep = '\t')
df

Out[6]:

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106
...	...	...	...	...	...	...
1699	Zimbabwe	Africa	1987	62.351	9216418	706.157306
1700	Zimbabwe	Africa	1992	60.377	10704340	693.420786
1701	Zimbabwe	Africa	1997	46.809	11404948	792.449960
1702	Zimbabwe	Africa	2002	39.989	11926563	672.038623
1703	Zimbabwe	Africa	2007	43.487	12311143	469.709298

1704 rows × 6 columns

시리즈와 데이터프레임¶

갭마인더 데이터 집합을 잘 불러왔나요? 이번에는 판다스에서 사용되는 자료형을 알아볼 차례입니다. 판다스는 데이터를 효율적으로 다루기 위해 시리즈(Series)와 데이터프레임(DataFrame)이라는 자료형을 사용합니다. 데이터프레임은 엑셀에서 볼 수 있는 시트(Sheet)와 동일한 개념이며 시리즈는 시트의 열 1개를 의미합니다. 파이썬으로 비유하여 설명하면 데이터프레임은 시리즈들이 각 요소가 되는 딕셔너리(Dictionary)라고 생각하면 됩니다.

1.¶

이번에는 df에 저장된 값이 정말 데이터프레임이라는 자료형인지 확인해 보겠습니다. 실행 결과를 보면 판다스의 데이터프레임이라는 것을 알 수 있습니다. type 메서드는 자료형을 출력해 줍니다. 앞으로 자주 사용할 메서드이므로 꼭 기억해 두기 바랍니다.

In [7]:

print(type(df))

<class 'pandas.core.frame.DataFrame'>

2.¶

데이터프레임은 자신이 가지고 있는 데이터의 행과 열의 크기에 대한 정보를 shape라는 속성에 저장하고 있습니다. 다음을 입력하여 실행하면 갭마인더의 행과 열의 크기를 확인할 수 있습니다. 1번째 값은 행의 크기이고 2번째 값은 열의 크기입니다.

In [8]:

print(df.shape)

(1704, 6)

3.¶

이번에는 갭마인더에 어떤 정보가 들어 있는지 알아보겠습니다. 먼저 열을 살펴보겠습니다. 과정 3에서 shape 속성을 사용했던 것처럼 columns 속성을 사용하면 데이터프레임의 열 이름을 확인할 수 있습니다. 갭마인더를 구성하는 열 이름은 각각 country,continent,year,lifeExp,pop,gdpPercap 입니다.

In [9]:

print(df.columns)

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

4.¶

데이터프레임을 구성하는 값의 자료형은 데이터프레임의 dtypes 속성이나 info 메서드로 쉽게 확인할 수 있습니다.

In [10]:

print(df.dtypes)

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

In [11]:

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None

판다스와 파이썬 자료형 비교¶

판다스와 파이썬은 같은 자료형도 다르게 인식합니다.

02-2 데이터 추출하기¶

열 단위 데이터 추출하기¶

1.¶

다음은 데이터프레임(df)에서 열이름이 country인 열을 추출하여 country_df에 저장한 것입니다. type 메서드를 사용하면 country_df에 저장된 데이터의 자료형이 시리즈라는 것을 확인할 수 있습니다. 시리즈도 head, tail 메서드를 가지고 있기 때문에 head,tail 메서드로 가장 앞이나 뒤에 있는 5개의 데이터를 출력할 수 있습니다.

In [12]:

country_df = df['country']
print(type(country_df))

<class 'pandas.core.series.Series'>

In [13]:

print(country_df.head())

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [14]:

print(country_df.tail())

1699    Zimbabwe
1700    Zimbabwe
1701    Zimbabwe
1702    Zimbabwe
1703    Zimbabwe
Name: country, dtype: object

2.¶

리스트에 열 이름을 전달하면 여러 개의 열을 한 번에 추출할 수 있습니다.이때 1개의 열이 아니라 2개 이상의 열을 추출했기 때문에 시리즈가 아니라 데이터프레임을 얻을 수 있습니다.

In [15]:

subset = df[['country','continent','year']]
print(type(subset))

<class 'pandas.core.frame.DataFrame'>

In [16]:

print(subset.head())

       country continent  year
0  Afghanistan      Asia  1952
1  Afghanistan      Asia  1957
2  Afghanistan      Asia  1962
3  Afghanistan      Asia  1967
4  Afghanistan      Asia  1972

In [17]:

print(subset.tail())

       country continent  year
1699  Zimbabwe    Africa  1987
1700  Zimbabwe    Africa  1992
1701  Zimbabwe    Africa  1997
1702  Zimbabwe    Africa  2002
1703  Zimbabwe    Africa  2007

행 단위 데이터 추출하기¶

데이터를 행단위로 추출하려면 loc,iloc 속성을 사용해야 합니다.

인덱스와 행 번호 개념 알아보기¶

loc 속성으로 행 데이터 추출하기¶

In [18]:

print(df.loc[0])

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap     779.445314
Name: 0, dtype: object

In [19]:

print(df.loc[99])

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap    721.186086
Name: 99, dtype: object

In [20]:

print(df.loc[2])

country      Afghanistan
continent           Asia
year                1962
lifeExp           31.997
pop             10267083
gdpPercap      853.10071
Name: 2, dtype: object

2.¶

만약 데이터프레임의 마지막 행 데이터를 추출하여면 어떻게 해야 할까요? 마지막 행데이터의 인덱스를 알아내야 합니다.

In [21]:

number_of_rows = df.shape[0]
last_row_index = number_of_rows - 1
print(df.loc[last_row_index])

country        Zimbabwe
continent        Africa
year               2007
lifeExp          43.487
pop            12311143
gdpPercap    469.709298
Name: 1703, dtype: object

3.¶

또 다른 방법

In [22]:

print(df.tail(n=1))

       country continent  year  lifeExp       pop   gdpPercap
1703  Zimbabwe    Africa  2007   43.487  12311143  469.709298

4.¶

만약 인덱스가 0,99,999인 데이터를 한 번에 추출하려면 리스트에 원하는 인덱스를 담아 loc 속성에 전달하면 됩니다.

In [23]:

print(df.loc[[0,99,999]])

         country continent  year  lifeExp       pop    gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333   779.445314
99    Bangladesh      Asia  1967   43.453  62821884   721.186086
999     Mongolia      Asia  1967   51.253   1149500  1226.041130

iooc 속성으로 행 데이터 추출하기¶

1.¶

이번에는 iloc 속성으로 행데이터를 추출하는 방법에 대해 알아보겠습니다. loc 속성은 데이터프레임의 인덱스를 사용하여 데이터를 추출했지만 iloc 속성은 데이터 순서를 의미하는 행 번호를 사용하여 데이터를 추출합니다. 지금은 인덱스와 행 번호가 동일하여 동일한 결괏값이 출력됩니다. 다음은 iloc속성에 1을 전달하여 데이터를 추출한 것입니다.

In [25]:

print(df.iloc[1])

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap      820.85303
Name: 1, dtype: object

In [26]:

print(df.iloc[99])

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap    721.186086
Name: 99, dtype: object

2.¶

iloc 속성은 음수를 사용해도 데이터를 추출할 수 있습니다.다음은 -1을 전달하여 마지막 행 데이터를 추출한 것입니다. 하지만 데이터프레임에 아예 존재하지 않는 행 번호를 전달하면 오류가 발생합니다.

In [27]:

print(df.iloc[-1])

country        Zimbabwe
continent        Africa
year               2007
lifeExp          43.487
pop            12311143
gdpPercap    469.709298
Name: 1703, dtype: object

In [28]:

print(df.iloc[1710])

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-28-e91d411323c7> in <module>
----> 1 print(df.iloc[1710])

c:\users\금동훈\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    893 
    894             maybe_callable = com.apply_if_callable(key, self.obj)
--> 895             return self._getitem_axis(maybe_callable, axis=axis)
    896 
    897     def _is_scalar_access(self, key: Tuple):

c:\users\금동훈\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1499 
   1500             # validate the location
-> 1501             self._validate_integer(key, axis)
   1502 
   1503             return self.obj._ixs(key, axis=axis)

c:\users\금동훈\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexing.py in _validate_integer(self, key, axis)
   1442         len_axis = len(self.obj._get_axis(axis))
   1443         if key >= len_axis or key < -len_axis:
-> 1444             raise IndexError("single positional indexer is out-of-bounds")
   1445 
   1446     # -------------------------------------------------------------------

IndexError: single positional indexer is out-of-bounds

3.¶

iloc 속성도 여러 데이터를 한 번에 추출할 수 있습니다. loc 속성을 사용했던 것처럼 원하는 데이터의 행 번호를 리스트에 담아 전달하면 됩니다.

In [29]:

print(df.iloc[[0, 99, 999]])

         country continent  year  lifeExp       pop    gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333   779.445314
99    Bangladesh      Asia  1967   43.453  62821884   721.186086
999     Mongolia      Asia  1967   51.253   1149500  1226.041130

loc, iloc 속성 자유자재로 사용하기¶

loc, iloc 속성을 좀더 자유자재로 사용하려면 추출할 데이터의 행과 열을 지정하는 방법을 알아야 합니다. 두 속성 모두 추출할 데이터의 행을 먼저 지정하고 그런 다음 열을 지정하는 방법으로 데이터를 추출합니다. 즉 df.loc[[행],[열]]이나 df.iloc[[행],[열]]과 같은 방법으로 코드를 작성하면 됩니다. 행과 열을 지정하는 방법은 슬라이싱 구문을 사용하는 방법과 range 메서드를 사용하는 방법이 있습니다.

데이터 추출하기 - 슬라이싱 구문, range 메서드¶

1. 슬라이싱 구문으로 데이터 추출하기¶

다음은 모든 행(:)의 데이터에 대해 year,pop 열을 추출하는 방법입니다. 이때 loc와 iloc 속성에 전달하는 열 지정값은 반드시 형식에 맞게 전달해야 합니다. 예를 들어 loc 속성의 열 지정값에 정수 리스트를 전달하면 오류가 발생합니다.

In [30]:

subset = df.loc[:,['year','pop']]
print(subset.head())

   year       pop
0  1952   8425333
1  1957   9240934
2  1962  10267083
3  1967  11537966
4  1972  13079460

In [31]:

subset = df.iloc[:,[2,3,-1]]
print(subset.head())

   year  lifeExp   gdpPercap
0  1952   28.801  779.445314
1  1957   30.332  820.853030
2  1962   31.997  853.100710
3  1967   34.020  836.197138
4  1972   36.088  739.981106

2. range 메서드로 데이터 추출하기¶

이번에는 iloc 속성과 파이썬 내장 메서드인 range를 응용하는 방법을 알아보겠습니다.range메서드는 지정한 구간의 정수 리스트를 반환해 줍니다. iloc 속성의 열 지정값에는 정수 리스트를 전달해야 한다는 점과 range 메서드의 반환값이 정수 리스트인 점을 이용하여 원하는 데이터를 추출하는 것이죠. 그런데 range 메서드는 조금 더 정확하게 말하면 지정한 범위의 정수 리스트를 반환하는 것이 아니라 제네레이터를 반환합니다.

In [32]:

small_range = list(range(5))
print(small_range)

[0, 1, 2, 3, 4]

In [33]:

print(type(small_range))

<class 'list'>

In [34]:

subset = df.iloc[:,small_range]
print(subset.head())

       country continent  year  lifeExp       pop
0  Afghanistan      Asia  1952   28.801   8425333
1  Afghanistan      Asia  1957   30.332   9240934
2  Afghanistan      Asia  1962   31.997  10267083
3  Afghanistan      Asia  1967   34.020  11537966
4  Afghanistan      Asia  1972   36.088  13079460

In [35]:

small_range = list(range(3,6))
print(small_range)

[3, 4, 5]

In [36]:

subset = df.iloc[:,small_range]
print(subset.head())

   lifeExp       pop   gdpPercap
0   28.801   8425333  779.445314
1   30.332   9240934  820.853030
2   31.997  10267083  853.100710
3   34.020  11537966  836.197138
4   36.088  13079460  739.981106

3.¶

range 메서드에 range(0, 6, 2) 와 같은 방법으로 3개의 인자를 전달하면 어떻게 될까요? 0부터 5까지 2만큼 건너뛰는 제네레이터를 생성합니다. 이 제네레이터를 리스트로 변환하면 범위는 0~5이고 짝수로 된 정수 리스트를 얻을 수 있죠.

In [37]:

small_range = list(range(0,6,2))
subset = df.iloc[:,small_range]
print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460

4. 슬라이싱 구문과 range 메서드 비교하기¶

그런데 실무에서는 range 메서드보다는 간편하게 하숑할 수 있는 파이썬 슬라이싱 구문을 더 선호합니다. range메서드가 반환한 제네레이터를 리스트로 변환하는 등의 과정을 거치지 않아도 되기 때문이죠. 예를 들어 list(range(3))과 [:3]의 결괏값은 동일합니다.

In [38]:

subset = df.iloc[:,:3]
print(subset.head())

       country continent  year
0  Afghanistan      Asia  1952
1  Afghanistan      Asia  1957
2  Afghanistan      Asia  1962
3  Afghanistan      Asia  1967
4  Afghanistan      Asia  1972

5.¶

0:6:2를 열 지정값에 전달하면 과정 3에서 얻은 결괏값과 동일한 결괏값을 얻을 수 있습니다.

In [40]:

subset = df.iloc[:,0:6:2]
print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460

6. loc, iloc 속성 자유자재로 사용하기¶

만약 iloc 속성으로 0,99,999 번째 행의 0,3,5번째 열 데이터를 추출하려면 다음과 같이 코드를 작성하면 됩니다.

In [41]:

print(df.iloc[[0,99,999],[0,3,5]])

         country  lifeExp    gdpPercap
0    Afghanistan   28.801   779.445314
99    Bangladesh   43.453   721.186086
999     Mongolia   51.253  1226.041130

7.¶

iloc 속성의 열 지정값으로 정수 리스트를 전달하는 것이 간편해 보일 수 있지만 이렇게 작성한 코드는 나중에 어떤 데이터를 추출하기 위한 코드인지 파악하지 못 할 수도 있습니다. 그래서 보통은 다음과 같은 방법으로 loc 속성을 이용하여 열 지정값으로 열 이름을 전달합니다.

In [44]:

print(df.loc[[0, 99, 999], ['country','lifeExp','gdpPercap']])

         country  lifeExp    gdpPercap
0    Afghanistan   28.801   779.445314
99    Bangladesh   43.453   721.186086
999     Mongolia   51.253  1226.041130

8.¶

앞으로 배운 내용을 모두 응용하여 데이터를 추출해 볼까요? 다음은 인덱스가 10인 행부터 13인 행의 country, lifeExp, gdpPercap 열 데이터를 추출하는 코드입니다.

In [46]:

print(df.loc[10:13,['country','lifeExp','gdpPercap']])

        country  lifeExp    gdpPercap
10  Afghanistan   42.129   726.734055
11  Afghanistan   43.828   974.580338
12      Albania   55.230  1601.056136
13      Albania   59.280  1942.284244

2-3 기초적인 통계 계산하기¶

In [47]:

print(df.head(n=10))

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
5  Afghanistan      Asia  1977   38.438  14880372  786.113360
6  Afghanistan      Asia  1982   39.854  12881816  978.011439
7  Afghanistan      Asia  1987   40.822  13867957  852.395945
8  Afghanistan      Asia  1992   41.674  16317921  649.341395
9  Afghanistan      Asia  1997   41.763  22227415  635.341351

그룹화한 데이터의 평균 구하기¶

1. lifeExp 열을 연도별로 그룹화하여 평균 계산하기¶

예를 들어 연도별 lifeExp열의 평균을 계산하려면 어떻게 해야 할까요? 데이터를 year 열로 그룹화하고 lifeExp 열의 평균을 구하면 됩니다.

In [50]:

print(df.groupby('year')['lifeExp'].mean())

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

2.¶

과정 1에서 작성한 코드를 작은 단위로 나누어 살펴보겠습니다. 먼저 데이터프레임을 연도별로 그룹화한 결과를 살펴보겠습니다. groupby 메서드에 year열 이름을 전달하면 연도별로 그룹화한 country, continent, ..., gdpPercap 열을 모은 데이터프레임을 얻을 수 있습니다.

In [51]:

grouped_year_df =  df.groupby('year')
print(type(grouped_year_df))

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

3.¶

grouped_year_df를 출력하려면 과정 2에서 얻은 데이터프레임이 저장된 메모리의 위치를 알 수 있습니다. 이 결과를 통해 연도별로 그룹화한 데이터는 데이터프레임 형태로 현재 메모리의 0x10d9340f0이라는 위치에 저장되어 있음을 알 수 있습니다.

In [52]:

print(grouped_year_df)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A4086832B0>

4.¶

이어서 lifeExp 열을 추출한 결과를 살펴보겠습니다.

In [53]:

grouped_year_df_lifeExp = grouped_year_df['lifeExp']
print(type(grouped_year_df_lifeExp))

<class 'pandas.core.groupby.generic.SeriesGroupBy'>

5.¶

마지막으로 평균을 구하는 mean 메서드를 사용한 결과를 살펴보겠습니다.

In [54]:

mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()
print(mean_lifeExp_by_year)

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

6.¶

lifeExp, gdpPercap 열의 평균값을 연도, 지역별로 그룹화하여 한 번에 계산하기

In [57]:

multi_group_var = df.groupby(['year','continent'])[['lifeExp','gdpPercap']].mean()
print(multi_group_var)

                  lifeExp     gdpPercap
year continent                         
1952 Africa     39.135500   1252.572466
     Americas   53.279840   4079.062552
     Asia       46.314394   5195.484004
     Europe     64.408500   5661.057435
     Oceania    69.255000  10298.085650
1957 Africa     41.266346   1385.236062
     Americas   55.960280   4616.043733
     Asia       49.318544   5787.732940
     Europe     66.703067   6963.012816
     Oceania    70.295000  11598.522455
1962 Africa     43.319442   1598.078825
     Americas   58.398760   4901.541870
     Asia       51.563223   5729.369625
     Europe     68.539233   8365.486814
     Oceania    71.085000  12696.452430
1967 Africa     45.334538   2050.363801
     Americas   60.410920   5668.253496
     Asia       54.663640   5971.173374
     Europe     69.737600  10143.823757
     Oceania    71.310000  14495.021790
1972 Africa     47.450942   2339.615674
     Americas   62.394920   6491.334139
     Asia       57.319269   8187.468699
     Europe     70.775033  12479.575246
     Oceania    71.910000  16417.333380
1977 Africa     49.580423   2585.938508
     Americas   64.391560   7352.007126
     Asia       59.610556   7791.314020
     Europe     71.937767  14283.979110
     Oceania    72.855000  17283.957605
1982 Africa     51.592865   2481.592960
     Americas   66.228840   7506.737088
     Asia       62.617939   7434.135157
     Europe     72.806400  15617.896551
     Oceania    74.290000  18554.709840
1987 Africa     53.344788   2282.668991
     Americas   68.090720   7793.400261
     Asia       64.851182   7608.226508
     Europe     73.642167  17214.310727
     Oceania    75.320000  20448.040160
1992 Africa     53.629577   2281.810333
     Americas   69.568360   8044.934406
     Asia       66.537212   8639.690248
     Europe     74.440100  17061.568084
     Oceania    76.945000  20894.045885
1997 Africa     53.598269   2378.759555
     Americas   71.150480   8889.300863
     Asia       68.020515   9834.093295
     Europe     75.505167  19076.781802
     Oceania    78.190000  24024.175170
2002 Africa     53.325231   2599.385159
     Americas   72.422040   9287.677107
     Asia       69.233879  10174.090397
     Europe     76.700600  21711.732422
     Oceania    79.740000  26938.778040
2007 Africa     54.806038   3089.032605
     Americas   73.608120  11003.031625
     Asia       70.728485  12473.026870
     Europe     77.648600  25054.481636
     Oceania    80.719500  29810.188275

7. 그룹화한 데이터 개수 세기¶

이번에는 그룹화한 데이터 개수가 몇 개인지 알아보겠습니다. 이를 통계에서는 '빈도수'라고 부릅니다. nunique 메서드를 사용하면 쉽게 구할 수 있습니다. 다음은 continent를 기준으로 데이터프레임을 만들고 country 열만 추출하여 데이터의 빈도수를 계산한 것입니다.

In [58]:

print(df.groupby('continent')['country'].nunique())

continent
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64

2-4 그래프 그리기¶

1. 먼저 그래프와 연관된 라이브러리를 불러옵니다.¶

In [70]:

%matplotlib inline
import matplotlib.pyplot as plt

In [75]:

global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
print(global_yearly_life_expextancy)

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

In [76]:

global_yearly_life_expectancy.plot()

Out[76]:

<AxesSubplot:xlabel='year'>

In [ ]:

출처 : "Do it 데이터 분석을 위한 판다스 입문"

'Do it 판다스 입문' 카테고리의 다른 글

Do it pandas Chapter 6. 누락값 처리하기 (0)	2021.03.28
Do it pandas Chapter 5. 데이터 연결하기 (0)	2021.03.26
Do it pandas Chapter 4. 그래프 그리기 (이어서) (0)	2021.03.25
Do it pandas Chapter 4. 그래프 그리기 (0)	2021.03.24
Do it padas Chapter 3. 판다스 데이터프레임과 시리즈 (0)	2021.03.22

Do it Pandas Chapter 2. 판다스 시작하기