Snippets Python / Pandas / DataFrame

Pandas - DataFrame

By Marcelo Fernandes Dec 3, 2017

Pandas DataFrame - Basics

The class DataFrame, is one of the most useful in pandas. It is described as "Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)." . It can be thought as a dict-like container for the Pandas Series class.
It construction goes like:


class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)


Where:

  • data: numpy ndarray (structured or homogeneous), dict, or DataFrame Dict can contain Series, arrays, constants, or list-like objects
  • index: Index or array-like. Index to use for resulting frame. Will default to np.arange(n) if no indexing information part of input data and no index provided
  • columns: Index or array-like. Column labels to use for resulting frame. Will default to np.arange(n) if no column labels are provided
  • dtype: Data type to force, otherwise infer
  • copy: Copy data from inputs. Only affects DataFrame / 2d ndarray input



Instantiation

You can use any of the following ways to instantiate a DataFrame:



from pandas import DataFrame

# Using a dictionary
>>> dict_data = {'name': ['Marcelo', 'John', 'Mary'],
                 'height': [1.80, 1.84, 1.65],
                 'favorite_food': ['sushi', 'barbecue', 'icecream']}

>>> df = DataFrame(dict_data)
>>> df
  favorite_food  height     name
0         sushi    1.80  Marcelo
1      barbecue    1.84     John
2      icecream    1.65     Mary

# Using arrays
>>> data = [['Marcelo', 'John', 'Mary'],
            [1.80, 1.84, 1.65],
            ['sushi', 'barbecue', 'icecream']]

>>> index = ['name', 'height', 'favorite_food']

>>> df2 = DataFrame(data=data, index=index)
# Representation looks different.
                     0         1         2
name           Marcelo      John      Mary
height             1.8      1.84      1.65
favorite_food    sushi  barbecue  icecream

# We can transpose our data.
>>> df2.T
      name height favorite_food
0  Marcelo    1.8         sushi
1     John   1.84      barbecue
2     Mary   1.65      icecream

# Other example using dicts again
>>> data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
            'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                     'Lions', 'Lions'],
            'wins': [11, 8, 10, 15, 11, 6, 10, 4],
            'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

>>> football = DataFrame(data)


I would recommend going with the dictionary format, as it is more straight forward and clear to understand at first glance.



Accessing the Data

Some times we only want to use a few dimensions of our data, and retrieve some insights from it. Pandas has a very interesting way to do that, check it out:



# Retrieving a dimension
>>> df['height']
0    1.80
1    1.84
2    1.65
Name: height, dtype: float64

>>> df[['height', 'name']]
   height     name
0    1.80  Marcelo
1    1.84     John
2    1.65     Mary


>>> df['height'] > 1.80
0    False
1     True
2    False
Name: height, dtype: bool

>>> df[df['height'] > 1.80]
  favorite_food  height  name
1      barbecue    1.84  John

# Checking columns based on other columns
>>> df['name'][df['height'] > 1.80]
1    John
Name: name, dtype: object

# Using AND and OR operators.
>>> football[(football.wins > 8) & (football.losses < 5)]
   losses     team  wins  year
3       1  Packers    15  2011

>>> football[(football.wins > 8) | (football.losses < 7)]
   losses     team  wins  year
0       5    Bears    11  2010
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012
6       6    Lions    10  2011


# Accessing the raw values of a dataframe.
>>> df.values
array([['sushi', 1.8, 'Marcelo'],
       ['barbecue', 1.84, 'John'],
       ['icecream', 1.65, 'Mary']], dtype=object)

# Slicing
>>> df[:1]
  favorite_food  height     name
0         sushi     1.8  Marcelo




Describing the Data

There are some interesting information about our data that are very useful to retrieve, and we will be doing that so many times. Pandas already knows it, so it prepared the method describe for us, to show a quick statistic summary of your data:



>>> football.describe()
          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000


Between the min and max, there are some percentage labels. Those are the percentiles, in this example, they are telling us that: 25% of the losses happened below 5 times. 50% of the wins happened below 10 times. And 75% of the losses happened below 8.5 times.
Therefore, the percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall.



Data Representation

Most often, you will be using a bunch of data, and they are very heavy to load. It might be very handy to retrieve a representation of only a few samples our data. We can use head() and tail(), to retrieve a representation of the beginning of our DataFrame, or from the end of our DataFrame respectively.



>>> football.head()
   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012

>>> football.tail(2)
   losses   team  wins  year
6       6  Lions    10  2011
7      12  Lions     4  2012


We can also retrieve the raw values or the columns that represent our data



>>> football.values
array([[5, 'Bears', 11, 2010],
       [8, 'Bears', 8, 2011],
       [6, 'Bears', 10, 2012],
       [1, 'Packers', 15, 2011],
       [5, 'Packers', 11, 2012],
       [10, 'Lions', 6, 2010],
       [6, 'Lions', 10, 2011],
       [12, 'Lions', 4, 2012]], dtype=object)

>>> football.columns
Index(['losses', 'team', 'wins', 'year'], dtype='object')




Applying Functions to Our Data



>>> football['losses'].apply((lambda x: x * 2 - x * 3))
0    -5
1    -8
2    -6
3    -1
4    -5
5   -10
6    -6
7   -12
Name: losses, dtype: int64

>>> football[['wins', 'losses']].apply(np.mean)
wins      9.375
losses    6.625
dtype: float64