How to learn about an unknown data set quickly? - R and Python

How to learn about an unknown data set quickly? - R and Python

When you come across an unknown data set, it is important to get to know about it before running into analysis. For instance, knowing the available fields, their data types, count of missing, unique or completed values and their distributions and presence/absence of outliers help to assess the suitability of the data set for the targeted analysis or where it needs cleaning.

R and Python have these functionalities readily available at various levels of detail.

R is my main EDA tool as of now and I am a big fan of tidyverse. When I first come across a data set in R, I usually use skim() function of the skimr package to get to know about the data set. Trying to find a similar function in Pandas was a frustrating experience until I came across Google Facets. In this post, I will first introduce you to skim() and show how to use Google Facets to get a similar outcome in Python.

Why skim() in R?

skim() is great to learn about the variables, their data type, missing values, unique values and some statistics on the distribution of variables of different types. Let me show you in an example below.

I use the Baby Names from Social Security Card Applications - National Data data set downloaded from catalog.data.gov/dataset/baby-names-from-so...

image.pngFigure 1. Summary of the data set from skim() in R

As shown in Figure 1, skim() lists the dimensions of the data set first, then groups variables by their data types and shows the count of missing, complete total and unique values. Depending on the data type it show some statistics on the distribution of data.

Thus, using just one function call I was able to learn about the data set as below.

  1. Data set consists of name, sex, its occurrence by year . All fields are completed, thus no issues with missing values.
  2. Data is available for 139 years from 1880 - 2018
  3. Out of the 200K baby names over 139 years in USA, there are about 98.4K unique names.
  4. A name has been re-used about 176 times on average over 139 years. However, the distribution of names' count is a skewed distribution with a long tail on right and a range between 5 - about 95K. That means, several names are much more popular than others.
  5. The shortest name has 2 characters, while the longest have 11.

While summaries can be generated by summary() or str() functions, the information they provide to get a thorough understanding of the data set is limited (Figures 2 and 3).

image.pngFigure 3. Console output of str() in R

Exploring similar avenues in Python

Python has info() and describe() functions that would give a more or less similar details to str() and summary() in R (Figures 4 and 5).

Being spoiled by skim() in R, I looked for an alternative in Python and came across Google Facets pair-code.github.io/facets. It is an opensource tool which you could either upload your data file to generate the summary, or embedded into Jupyter notebooks in Python. Summaries are generated as an 'Overview', similar to skim(), or even deeper as Dive.

Here's how to generate an overview of the data using Google Facets and Jupyter. Make sure the facets-overview package is installed in the python environment. The below code snippet is from github.com/PAIR-code/facets/tree/master/fac... Make sure that the current data set is passed into ProtoFromDataFrames().

#@title Install the facets_overview pip package.
#!pip install facets-overview

# Create the feature stats for the datasets and stringify it.
import base64
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'babynames', 'table': dat}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

# Display the facets overview visualization for this data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

image.pngFigure 4. Summary of the data set generated by Google Facets

As shown in Figure 4, Google Facets provide similar information on the variable types, their total counts and count of missing, unique values and some statistics similar to skim(). In addition, I found the information provided on the top category very useful. For instance, we now know that the most popular name is William although the females dominate over males in the data set.

For more information on Google Facets: pair-code.github.io/facets