24 minute read

Saya ingat dulu waktu zaman kuliah, setidaknya membutuhkan waktu 30 menit untuk melakukan dan membuat report berisi analisa statistika deskriptif dari suatu data. Di zaman sekarang, untuk melakukan hal yang sama, saya hanya butuh waktu tak lebih dari 1 menit saja.

Berikut ini akan saya tunjukkan beberapa alternatif melakukan dan membuat analisa statistika deskriptif sederhana dengan R. Sebagai contoh, saya akan gunakan data mtcars dari bawaan R sebagai berikut:

df %>% knitr::kable()
  mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Oke, saya mulai ya!

Menggunakan base R

Kita bisa menggunakan function summary() untuk mendapatkan summary statistic berikut:

df %>% summary()
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Kita bisa menggunakan function str() untuk mendapatkan informasi struktur data berikut:

df %>% str()
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Menggunakan library(dplyr)

Berikut adalah summary statistics menggunakan library(dplyr).

df %>% glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Menggunakan library(skimr)

Jika kalian hendak mendapatkan analisa yang lebih lengkap, kita bisa memanfaatkan library(skimr) berikut ini:

df %>% skim()
   
Name Piped data
Number of rows 32
Number of columns 11
_______________________  
Column type frequency:  
numeric 11
________________________  
Group variables None

Data summary

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 ▃▇▅▁▂
cyl 0 1 6.19 1.79 4.00 4.00 6.00 8.00 8.00 ▆▁▃▁▇
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 ▇▇▆▃▁
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 ▃▃▇▁▂
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 ▃▇▇▂▁
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 ▇▁▆▁▂
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 ▇▂▅▁▁

Menggunakan library(psych)

Alternatif lain mendapatkan summary statistics dalam bentuk tabel adalah dengan memanfaatkan library(psych).

psych::describe(df)
     vars  n   mean     sd median trimmed    mad   min    max  range  skew
mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
     kurtosis    se
mpg     -0.37  1.07
cyl     -1.76  0.32
disp    -1.21 21.91
hp      -0.14 12.12
drat    -0.71  0.09
wt      -0.02  0.17
qsec     0.34  0.32
vs      -2.00  0.09
am      -1.92  0.09
gear    -1.07  0.13
carb     1.26  0.29

Menggunakan library(Hmisc)

Alternatif lain mendapatkan summary statistics dalam bentuk narasi per variabel adalah dengan memanfaatkan library(Hmisc).

Hmisc::describe(df)
df 

 11  Variables      32  Observations
--------------------------------------------------------------------------------
mpg 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
      32        0       25    0.999    20.09     19.6    6.796    12.00 
     .10      .25      .50      .75      .90      .95 
   14.34    15.43    19.20    22.80    30.09    31.30 

lowest : 10.4 13.3 14.3 14.7 15  , highest: 26   27.3 30.4 32.4 33.9
--------------------------------------------------------------------------------
cyl 
       n  missing distinct     Info     Mean  pMedian      Gmd 
      32        0        3    0.866    6.188        6    1.948 
                            
Value          4     6     8
Frequency     11     7    14
Proportion 0.344 0.219 0.438

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
disp 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
      32        0       27    0.999    230.7    223.4    142.5    77.35 
     .10      .25      .50      .75      .90      .95 
   80.61   120.83   196.30   326.00   396.00   449.00 

lowest : 71.1 75.7 78.7 79   95.1, highest: 360  400  440  460  472 
--------------------------------------------------------------------------------
hp 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
      32        0       22    0.997    146.7    142.5    77.04    63.65 
     .10      .25      .50      .75      .90      .95 
   66.00    96.50   123.00   180.00   243.50   253.55 

lowest :  52  62  65  66  91, highest: 215 230 245 264 335
--------------------------------------------------------------------------------
drat 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
      32        0       22    0.997    3.597    3.575   0.6099    2.853 
     .10      .25      .50      .75      .90      .95 
   3.007    3.080    3.695    3.920    4.209    4.314 

lowest : 2.76 2.93 3    3.07 3.08, highest: 4.08 4.11 4.22 4.43 4.93
--------------------------------------------------------------------------------
wt 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
      32        0       29    0.999    3.217    3.186    1.089    1.736 
     .10      .25      .50      .75      .90      .95 
   1.956    2.581    3.325    3.610    4.048    5.293 

lowest : 1.513 1.615 1.835 1.935 2.14 , highest: 3.845 4.07  5.25  5.345 5.424
--------------------------------------------------------------------------------
qsec 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
      32        0       30        1    17.85     17.8    2.009    15.05 
     .10      .25      .50      .75      .90      .95 
   15.53    16.89    17.71    18.90    19.99    20.10 

lowest : 14.5  14.6  15.41 15.5  15.84, highest: 19.9  20    20.01 20.22 22.9 
--------------------------------------------------------------------------------
vs 
       n  missing distinct     Info      Sum     Mean 
      32        0        2    0.739       14   0.4375 

--------------------------------------------------------------------------------
am 
       n  missing distinct     Info      Sum     Mean 
      32        0        2    0.724       13   0.4062 

--------------------------------------------------------------------------------
gear 
       n  missing distinct     Info     Mean  pMedian      Gmd 
      32        0        3    0.841    3.688      3.5   0.7863 
                            
Value          3     4     5
Frequency     15    12     5
Proportion 0.469 0.375 0.156

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
carb 
       n  missing distinct     Info     Mean  pMedian      Gmd 
      32        0        6    0.929    2.812      2.5    1.718 
                                              
Value          1     2     3     4     6     8
Frequency      7    10     3    10     1     1
Proportion 0.219 0.312 0.094 0.312 0.031 0.031

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------

Menggunakan library(GGally)

Jika kita membutuhkan summary statistics berupa density plot dan korelasi antar variabel, kita bisa menggunakan library(GGally).

GGally::ggpairs(df)

Menggunakan library(DataExplorer)

Ada satu library yang bisa digunakan untuk mendapatkan satu file report berformat .html yakni bernama library(DataExplorer).

Hasilnya meng-cover analisa sebagai berikut:

  • Basic Statistics
    • Raw Counts
    • Percentages
  • Data Structure
  • Missing Data Profile
  • Univariate Distribution
    • Histogram
    • QQ Plot
  • Correlation Analysis
  • Principal Component Analysis

Mohon maaf saya tak bisa menampilkannya dalam blog ini tapi kalian bisa mencobanya sendiri dengan skrip:

DataExplorer::create_report(df)

if you find this article helpful, support this blog by clicking the ads.