본문 바로가기

Data Analysis/Exploratory Data Analysis

Pandas (4) Dataframe 연산

 

전체 DataFrame에 대한 연산

 
In [31]:
pd.options.display.max_rows = 8
movie = pd.read_csv(r'C:\Users\user\jupyterpractice\EDA\Pandas-Cookbook-master\data\movie.csv')
movie.shape
 
Out[31]:
(4916, 28)
 
 
In [32]:
movie.size
 
Out[32]:
137648
 
 
In [33]:
# numpy에서 ndim은 차원의 수를 반환함
movie.ndim
 
Out[33]:
2
 
 
In [34]:
# len은 row의 수를 반환함
len(movie)
 
Out[34]:
4916
 
 
In [35]:
# count method로 각 열의 누락값을 제외한 실제 값의 개수를 알아낸다.
movie.count()
 
Out[35]:
color                     4897
director_name             4814
num_critic_for_reviews    4867
duration                  4901
                          ... 
actor_2_facebook_likes    4903
imdb_score                4916
aspect_ratio              4590
movie_facebook_likes      4916
Length: 28, dtype: int64
 
 
In [36]:
movie.min()
 
Out[36]:
num_critic_for_reviews        1
duration                      7
director_facebook_likes       0
actor_3_facebook_likes        0
                           ... 
actor_2_facebook_likes        0
imdb_score                  1.6
aspect_ratio               1.18
movie_facebook_likes          0
Length: 19, dtype: object
 
 
In [37]:
# 위의 모든 descriptive statistics(기술적인 통계)를 반환하는 describe() method
# 결과는 descriptive statistics를 index로 가지는 DataFrame
movie.describe()
 
Out[37]:
  num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4867.000000 4901.000000 4814.000000 4893.000000 4909.000000 4.054000e+03 4.916000e+03 4916.000000 4903.000000 4895.000000 4.432000e+03 4810.000000 4903.000000 4916.000000 4590.000000 4916.000000
mean 137.988905 107.090798 691.014541 631.276313 6494.488491 4.764451e+07 8.264492e+04 9579.815907 1.377320 267.668846 3.654749e+07 2002.447609 1621.923516 6.437429 2.222349 7348.294142
std 120.239379 25.286015 2832.954125 1625.874802 15106.986884 6.737255e+07 1.383222e+05 18164.316990 2.023826 372.934839 1.002427e+08 12.453977 4011.299523 1.127802 1.402940 19206.016458
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% 49.000000 93.000000 7.000000 132.000000 607.000000 5.019656e+06 8.361750e+03 1394.750000 0.000000 64.000000 6.000000e+06 1999.000000 277.000000 5.800000 1.850000 0.000000
50% 108.000000 103.000000 48.000000 366.000000 982.000000 2.504396e+07 3.313250e+04 3049.000000 1.000000 153.000000 1.985000e+07 2005.000000 593.000000 6.600000 2.350000 159.000000
75% 191.000000 118.000000 189.750000 633.000000 11000.000000 6.110841e+07 9.377275e+04 13616.750000 2.000000 320.500000 4.300000e+07 2011.000000 912.000000 7.200000 2.350000 2000.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 4.200000e+09 2016.000000 137000.000000 9.500000 16.000000 349000.000000
 
 
In [38]:
pd.options.display.max_rows = 10
 
 
In [39]:
# percentiles 매개변수를 통해 정확한 분위수를 지정할 수 있음
movie.describe(percentiles=[.01, .3, .99])
 
Out[39]:
  num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4867.000000 4901.000000 4814.000000 4893.000000 4909.000000 4.054000e+03 4.916000e+03 4916.000000 4903.000000 4895.000000 4.432000e+03 4810.000000 4903.000000 4916.000000 4590.000000 4916.000000
mean 137.988905 107.090798 691.014541 631.276313 6494.488491 4.764451e+07 8.264492e+04 9579.815907 1.377320 267.668846 3.654749e+07 2002.447609 1621.923516 6.437429 2.222349 7348.294142
std 120.239379 25.286015 2832.954125 1625.874802 15106.986884 6.737255e+07 1.383222e+05 18164.316990 2.023826 372.934839 1.002427e+08 12.453977 4011.299523 1.127802 1.402940 19206.016458
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
1% 2.000000 43.000000 0.000000 0.000000 6.080000 8.474800e+03 5.300000e+01 6.000000 0.000000 1.940000 6.000000e+04 1951.000000 0.000000 3.100000 1.330000 0.000000
30% 60.000000 95.000000 11.000000 176.000000 694.000000 7.914069e+06 1.186450e+04 1684.500000 0.000000 80.000000 8.000000e+06 2000.000000 345.000000 6.000000 1.850000 0.000000
50% 108.000000 103.000000 48.000000 366.000000 982.000000 2.504396e+07 3.313250e+04 3049.000000 1.000000 153.000000 1.985000e+07 2005.000000 593.000000 6.600000 2.350000 159.000000
99% 546.680000 189.000000 16000.000000 11000.000000 44920.000000 3.264128e+08 6.815846e+05 62413.900000 8.000000 1999.240000 2.000000e+08 2016.000000 17000.000000 8.500000 4.000000 93850.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 4.200000e+09 2016.000000 137000.000000 9.500000 16.000000 349000.000000
 
 
In [40]:
pd.options.display.max_rows = 8
 
 
In [42]:
# 결측치(누락값)의 개수 세기 : isnull().sum() method chaining
movie.isnull().sum()
 
Out[42]:
color                      19
director_name             102
num_critic_for_reviews     49
duration                   15
                         ... 
actor_2_facebook_likes     13
imdb_score                  0
aspect_ratio              326
movie_facebook_likes        0
Length: 28, dtype: int64
 

 

skipna 매개변수 : 결측치(누락값)을 무시하지 않는 방법

  • pandas는 default로 수치열의 누락값을 무시하고 통계값을 냄 -> skipna=True가 default인 것.
  • skipna = False로 해줌으로써, 하나라도 누락값(결측치)가 있으면 NaN을 반환하도록 할 수 있음.
 
In [29]:
movie.min(skipna=False)
 
Out[29]:
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
                          ... 
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64
 

'Data Analysis > Exploratory Data Analysis' 카테고리의 다른 글

Pandas (5) Method Chaining  (0) 2021.09.29
Pandas (3) Column 네이밍  (0) 2021.09.27
Pandas (2) Column 조작  (0) 2021.09.24
Pandas (1) 데이터 정보 확인  (0) 2021.09.23