In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set_context("talk")


# Numpy Array¶

In [2]:
import numpy as np


As you learned in homework one the np.array is the key data structure in numpy for dense arrays of data.

# Array Creation¶

• From a list
In [3]:
np.array([[1.,2.], [3.,4.]])

Out[3]:
array([[ 1.,  2.],
[ 3.,  4.]])
In [4]:
np.array([["A", "matrix"], ["of", "words."]])

Out[4]:
array([['A', 'matrix'],
['of', 'words.']],
dtype='<U6')
• All zeros (what shape?)
In [5]:
np.zeros(5)

Out[5]:
array([ 0.,  0.,  0.,  0.,  0.])
• All ones
In [6]:
np.ones([3,2])

Out[6]:
array([[ 1.,  1.],
[ 1.,  1.],
[ 1.,  1.]])
In [7]:
np.eye(4)

Out[7]:
array([[ 1.,  0.,  0.,  0.],
[ 0.,  1.,  0.,  0.],
[ 0.,  0.,  1.,  0.],
[ 0.,  0.,  0.,  1.]])
• From a range:
In [8]:
np.arange(0, 10, 2)

Out[8]:
array([0, 2, 4, 6, 8])
In [9]:
np.linspace(0, 5, 10)

Out[9]:
array([ 0.        ,  0.55555556,  1.11111111,  1.66666667,  2.22222222,
2.77777778,  3.33333333,  3.88888889,  4.44444444,  5.        ])
In [10]:
np.arange(np.datetime64('2016-12-31'), np.datetime64('2017-02-01'))

Out[10]:
array(['2016-12-31', '2017-01-01', '2017-01-02', '2017-01-03',
'2017-01-04', '2017-01-05', '2017-01-06', '2017-01-07',
'2017-01-08', '2017-01-09', '2017-01-10', '2017-01-11',
'2017-01-12', '2017-01-13', '2017-01-14', '2017-01-15',
'2017-01-16', '2017-01-17', '2017-01-18', '2017-01-19',
'2017-01-20', '2017-01-21', '2017-01-22', '2017-01-23',
'2017-01-24', '2017-01-25', '2017-01-26', '2017-01-27',
'2017-01-28', '2017-01-29', '2017-01-30', '2017-01-31'], dtype='datetime64[D]')

• A random array (more on randomness later)
In [11]:
np.random.randn(3,2)

Out[11]:
array([[-0.46986367, -1.8230748 ],
[ 1.20935776,  0.54400893],
[ 0.2295634 ,  0.08987276]])
In [12]:
np.random.permutation(10)

Out[12]:
array([4, 2, 3, 0, 1, 9, 5, 6, 8, 7])

# Properties of Arrays¶

### Shape¶

In [13]:
A = np.array([[1.,2., 3.], [4.,5., 6.]])
print(A)
A.shape

[[ 1.  2.  3.]
[ 4.  5.  6.]]

Out[13]:
(2, 3)

### Type¶

In [14]:
A.dtype

Out[14]:
dtype('float64')
In [15]:
np.arange(1,5).dtype

Out[15]:
dtype('int64')
In [16]:
(np.arange(1,5) > 3).dtype

Out[16]:
dtype('bool')
In [17]:
np.array(["Hello", "World!"]).dtype

Out[17]:
dtype('<U6')

#### and we can change the type of an array:¶

In [18]:
np.array([1,2,3]).astype(float)

Out[18]:
array([ 1.,  2.,  3.])
In [19]:
np.array(["1","2","3"]).astype(int)

Out[19]:
array([1, 2, 3])

# Jagged Arrays¶

Is the following valid?

In [20]:
A = np.array([[1, 2, 3], [4, 5], [6]])
A

Out[20]:
array([[1, 2, 3], [4, 5], [6]], dtype=object)

What happened?

In [21]:
A.shape

Out[21]:
(3,)
In [22]:
print(A.dtype)

object

In [23]:
print(A[0])
print(A[1])
print(A[2])

[1, 2, 3]
[4, 5]
[6]


### Jagged arrays can be problematic:¶

1. Difficult to index
A[0,1]
> Error
A[0][1]
> 2

2. Difficult to extract columns.
Some solutions  pd.Series(A).apply(lambda x: pd.Series(x)) np.vectorize(lambda x: x[0])(A) 

# Reshaping¶

Often you will need to reshape matrices. Suppose you have the following array:

In [24]:
np.arange(1,13)

Out[24]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

What will the following produce:

np.arange(1,13).reshape(4,3)


Option A:

array([[ 1,  2,  3],
[ 4,  5,  6],
[ 7,  8,  9],
[10, 11, 12]])


Option B:

array([[ 1,  5,  9],
[ 2,  6, 10],
[ 3,  7, 11],
[ 4,  8, 12]])


Solution

In [25]:
A = np.arange(1,13).reshape(4,3)
A

Out[25]:
array([[ 1,  2,  3],
[ 4,  5,  6],
[ 7,  8,  9],
[10, 11, 12]])
• You can also flatten a matrix
In [26]:
A.flatten()

Out[26]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

Numpy stores data contiguously in memory

In [27]:
A.data.tobytes()

Out[27]:
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00'

Numpy stores matrices in row-major order (by rows)

In [28]:
print(np.arange(1,13).reshape(4,3, order='C'))
print()
print(np.arange(1,13).reshape(4,3, order='F'))

[[ 1  2  3]
[ 4  5  6]
[ 7  8  9]
[10 11 12]]

[[ 1  5  9]
[ 2  6 10]
[ 3  7 11]
[ 4  8 12]]

What does the 'F' mean? **F**ortran ordering. In BLAS libraries are specified for Fortran and C programming languages which differ both in the column (Fortran) or row (C) indexing.

# Slicing¶

From homework 1 you should already be pretty good at Slicing so let's test your slicing knowledge.

• Program 1:
x[:, 0]


B

• Program 2:
x[0, :]


A

• Program 3:
x[:2, 1:]


H

• Program 4:
x[0::2, :]


D

The last program was a bit tricky:

begin:end:stride


### Modifying a Slice¶

Suppose I wanted to make all entries in my matrix 0 in the top right corner as in (H) above.

In [29]:
A = np.arange(1,13).reshape(4,3)
print("Before:\n", A)

Before:
[[ 1  2  3]
[ 4  5  6]
[ 7  8  9]
[10 11 12]]

In [30]:
A[:2, 1:] = 0
print("After:\n", A)

After:
[[ 1  0  0]
[ 4  0  0]
[ 7  8  9]
[10 11 12]]


# Boolean Indexing¶

We can apply boolean operations to arrays. This is essential when trying to select and modify individual elements.

Question: Given the following definition of A:

[[   1.    2.    3.]
[   4.    5. -999.]
[   7.    8.    9.]
[  10. -999. -999.]]


what will the following output:

A > 3

• Option A:
False

• Option B:
array([[False, False, False],
[ True,  True, False],
[ True,  True,  True],
[ True, False, False]], dtype=bool)

In [31]:
A = np.array([[  1.,   2.,   3.],
[  4.,   5.,   -999.0],
[  7.,   8.,   9.],
[ 10.,  -999.0,  -999.0]])

A > 3.

Out[31]:
array([[False, False, False],
[ True,  True, False],
[ True,  True,  True],
[ True, False, False]], dtype=bool)

Question: What will the following output

A = np.array([[   1.,    2.,    3.],
[   4.,    5., -999.],
[   7.,    8.,    9.],
[  10., -999., -999.]])

A[A > 3]

• Option A:
array([ 4,  7, 10,  5,  8, 11,  6,  9, 12])

• Option B:
array([  4.,   5.,   7.,   8.,   9.,  10.])

• Option C:
array([[  nan,   nan,  nan],
[  4.,    5.,   nan],
[  7.,    8.,   9.],
[ 10.,    nan,  nan]])

In [32]:
A = np.array([[  1.,   2.,   3.],
[  4.,   5.,   -999.0],
[  7.,   8.,   9.],
[ 10.,  -999.0,  -999.0]])

A[A > 3]

Out[32]:
array([  4.,   5.,   7.,   8.,   9.,  10.])

Question: Why is the answer not two dimensional?

The -999.0 numbers seem like a place holder. Replace them with np.nan.

array([[   1.,    2.,    3.],
[   4.,    5., -999.],
[   7.,    8.,    9.],
[  10., -999., -999.]])


Solution

In [33]:
A = np.array([[  1.,   2.,   3.],
[  4.,   5.,   -999.0],
[  7.,   8.,   9.],
[ 10.,  -999.0,  -999.0]])


Construct a boolean array that indicates where the value is 999.0:

In [34]:
ind = (A == -999.0)
print(ind)

[[False False False]
[False False  True]
[False False False]
[False  True  True]]


Assign 0.0 to all the True entires:

In [35]:
A[ind] = np.nan
A

Out[35]:
array([[  1.,   2.,   3.],
[  4.,   5.,  nan],
[  7.,   8.,   9.],
[ 10.,  nan,  nan]])

Why might I want to replace the -999.0 with a np.nan?

## Solution¶

Could be safer in calculations

In [36]:
np.mean(A)

Out[36]:
nan

In [37]:
np.nanmean(A)

Out[37]:
5.4444444444444446

# More Complex Bit Logic¶

In [38]:
names = np.array(["Joey", "Henry", "Joseph",
"Jim", "Sam", "Deb", "Mike",
"Bin", "Joe", "Andrew", "Bob"])

favorite_number = np.arange(len(names))

staff = ["Joey", "Bin", "Deb", "Joe", "Sam", "Henry", "Andrew", "Joseph"]


### What is sum of the staff's favorite numbers:¶

Solution

Determine which people are staff using the np.in1d function

In [39]:
is_staff = np.in1d(names, staff)
is_staff

Out[39]:
array([ True,  True,  True, False,  True,  True, False,  True,  True,
True, False], dtype=bool)

Boolean indexing

In [40]:
favorite_number[is_staff].sum()

Out[40]:
36

### What does the following expression compute:¶

starts_with_j = np.char.startswith(names, "J")
starts_with_j[is_staff].mean()


**Solution**

The fraction of the staff have names that begin with J?

In [41]:
starts_with_j = np.char.startswith(names, "J")
starts_with_j[is_staff].mean()

Out[41]:
0.375
What does it mean to take the mean of an array of booleans?

The values True and False correspond to the integers 1 and 0 and are treated as such in mathematical expressions (e.g., mean(), sum(), as well as linear algebraic operations).

### What does the following expression compute:¶

favorite_number[starts_with_j & is_staff].sum()


**Solution**

What is the sum of the favorite numbers of staff starting with J

In [42]:
favorite_number[starts_with_j & is_staff].sum()

Out[42]:
10

# A Note on using Array operations¶

In [43]:
data = np.random.rand(1000000)


Consider the following two programs.

1. What do they do?
2. Which one is faster?
In [44]:
%%timeit
s = 0
for x in data:
if x > 0.5:
s += x
result = s/len(data)

10 loops, best of 3: 123 ms per loop

In [45]:
%%timeit
result = data[data > 0.5].mean()

100 loops, best of 3: 7.22 ms per loop


## Important Points¶

Using the array abstractions instead of looping can often be:

1. Clearer
2. Faster

These are fundamental goals of abstraction.

# Mathematical operations:¶

Numpy arrays support standard mathematical operations

In [46]:
A = np.arange(1., 13.).reshape(4,3)
print(A)

[[  1.   2.   3.]
[  4.   5.   6.]
[  7.   8.   9.]
[ 10.  11.  12.]]

In [47]:
A * 0.5 + 3

Out[47]:
array([[ 3.5,  4. ,  4.5],
[ 5. ,  5.5,  6. ],
[ 6.5,  7. ,  7.5],
[ 8. ,  8.5,  9. ]])

notice that operations are element wise.

In [48]:
A.T

Out[48]:
array([[  1.,   4.,   7.,  10.],
[  2.,   5.,   8.,  11.],
[  3.,   6.,   9.,  12.]])
In [49]:
A.sum()

Out[49]:
78.0

## Be Careful with Floating Point Numbers¶

What is the value of the following: $$A - \exp \left( \log \left( A \right) \right)$$

Solution:

In [50]:
A = np.arange(1., 13.).reshape(4,3)
print(A)

(A - np.exp(np.log(A)))

[[  1.   2.   3.]
[  4.   5.   6.]
[  7.   8.   9.]
[ 10.  11.  12.]]

Out[50]:
array([[  0.00000000e+00,   0.00000000e+00,  -4.44089210e-16],
[  0.00000000e+00,   8.88178420e-16,   0.00000000e+00],
[  8.88178420e-16,   1.77635684e-15,  -1.77635684e-15],
[ -1.77635684e-15,  -1.77635684e-15,   0.00000000e+00]])
**What happened?!** Floating point precision is not perfect and we are applying transcendental functions.

### A simpler examples¶

What is the value of the following expression:

In [51]:
0.1 + 0.2 == 0.3

Out[51]:
False
In [52]:
print(0.1 + 0.2)

0.30000000000000004


## Aggregating along an axis¶

### Grouping by row:¶

In [53]:
A.sum(axis=0)

Out[53]:
array([ 22.,  26.,  30.])

This is the same as:

[r,c] = A.shape

s = np.zeros(c)
for i in range(r):
s += A[i,:]

print(s)


### Grouping by col:¶

In [54]:
A.sum(axis=1)

Out[54]:
array([  6.,  15.,  24.,  33.])

This is the same as:

[r,c] = A.shape

s = np.zeros(r)
for i in range(c):
s += A[:,i]

print(s)


and many more

# Linear Algebra¶

Suppose we wanted to use linear algebra to compute the sum along axis 1

$$\texttt{A.sum(axis=1)} = \sum_j A_{i,j} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6\end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1 \\ 1\end{bmatrix}$$
In [55]:
b = np.ones(3)
A * b

Out[55]:
array([[  1.,   2.,   3.],
[  4.,   5.,   6.],
[  7.,   8.,   9.],
[ 10.,  11.,  12.]])

## ¶

Explanation:

We ended up computing an element-wise product. The vector of ones was replicated once for each row and then used to scale the entire row.

### The correct expression for matrix multiplication¶

In [56]:
A.dot(b)

Out[56]:
array([  6.,  15.,  24.,  33.])

Python 3.5 feature:

In [57]:
A @ b

Out[57]:
array([  6.,  15.,  24.,  33.])

Using the binary infix operator @ is clearer to read.

# [Skip in Class] Solving Linear Systems¶

Suppose you are asked to solve the following system of linear equations:

$$5x - 3y = 2 \\ -9x + 2y = -7$$

this means that we want to solve the following linear systems:

$$\begin{bmatrix} 5 & -3 \\ -9 & 2 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 2 \\ -7 \end{bmatrix}$$

Solving for $x$ and $y$ we get:

$$\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 5 & -3 \\ -9 & 2 \end{bmatrix}^{-1} \begin{bmatrix} 2 \\ -7 \end{bmatrix}$$

This can be solved numerically using NumPy:

In [58]:
A = np.array([[5, -3], [-9, 2]])
b = np.array([2,-7])

In [59]:
from numpy.linalg import inv
inv(A) @ b

Out[59]:
array([ 1.,  1.])

Preferred way to solve (more numerically stable)

In [60]:
from numpy.linalg import solve
solve(A, b)

Out[60]:
array([ 1.,  1.])

Two points:

1. Issue with performance
2. Issue with numerical stability

When the matrix is not full rank it may be necessary to use lstsq.

# [Skip In Class] Numpy in Higher Dimensions and why blue is a bad color¶

In homework 1 you had a chance to work with a tensor. Images are tensors:

In [61]:
import scipy.ndimage
image.shape

Out[61]:
(225, 400, 3)
In [62]:
with sns.axes_style("white"):
plt.imshow(image)


### What does the following expression do:¶

image[:, :, np.arange(3) != 1] = 0


Solution

In [63]:
imageC = image.copy()
imageC[:,:,np.arange(3) != 1] = 0

with sns.axes_style("white"):
plt.imshow(imageC)


## Comparing green and blue information¶

Extracting the Green and Blue information

In [64]:
green_data = np.floor(image[:,:,1]).astype(image.dtype)

blue_data = np.floor(image[:,:,2]).astype(image.dtype)

In [65]:
plt.figure(figsize=(20,10))
with sns.axes_style("white"):
plt.subplot(1,2,1)
plt.title("Green")
plt.imshow(green_data)
plt.subplot(1,2,2)
plt.title("Blue")
plt.imshow(blue_data)

**Question:** *Why does Blue look so bad?*

The human eye is not very sensitive to the color blue. As a consequence jpeg compression algorithms tend to more aggressively compress the blue dimension.

# Pandas and the DataFrame¶

In [66]:
import pandas as pd


## Series¶

The numpy arrays we have been using contain data of a single type and the location of the data item matters. Often when working with data the location of the data only matters in a relative sense.

Consider the following city population data:

In [67]:
pop = np.array([10.01, 0.84, 8.41, 13.62, np.nan, 0.84, 0.01, 0.01])
city = ["Seoul", "SF", "NYC", "Tokyo", "Typo", "SF", "Mechville", "Zootopia"]


As long as the population and city stay aligned the exact ordering doesn't matter. Furthermore, the data is really a mapping from city name to population. We would like a data structure that couples these two relates pieces of data.

To address this need Pandas introduces the Series object as a fundamental representation of column of data of the same type.

In [68]:
pop_series = pd.Series(pop, index=city, name="Population")
pop_series

Out[68]:
Seoul        10.01
SF            0.84
NYC           8.41
Tokyo        13.62
Typo           NaN
SF            0.84
Mechville     0.01
Zootopia      0.01
Name: Population, dtype: float64

We can lookup elements by their value

In [69]:
pop_series["Seoul"]

Out[69]:
10.01

Notice that since SF occurs twice we get a series back:

In [70]:
pop_series["SF"]

Out[70]:
SF    0.84
SF    0.84
Name: Population, dtype: float64

### Lookup by location (should be avoided, why?)¶

In [71]:
pop_series.iloc[1:3]

Out[71]:
SF     0.84
NYC    8.41
Name: Population, dtype: float64

### Slicing with predicates¶

In [72]:
pop_series[pop_series > 10.]

Out[72]:
Seoul    10.01
Tokyo    13.62
Name: Population, dtype: float64

### Important Point!¶

Notice that slicing a series preserves the index and the name.

• This is a big improvement over numpy where the meaning of the data is lost when the location changes.

However if you insist you can still access the underlying numpy arrays

In [73]:
pop_series.values

Out[73]:
array([  1.00100000e+01,   8.40000000e-01,   8.41000000e+00,
1.36200000e+01,              nan,   8.40000000e-01,
1.00000000e-02,   1.00000000e-02])
In [74]:
pop_series.index

Out[74]:
Index(['Seoul', 'SF', 'NYC', 'Tokyo', 'Typo', 'SF', 'Mechville', 'Zootopia'], dtype='object')

### We can apply mathematical operations to the series:¶

In [75]:
np.log(pop_series + 1.0)

Out[75]:
Seoul        2.398804
SF           0.609766
NYC          2.241773
Tokyo        2.682390
Typo              NaN
SF           0.609766
Mechville    0.009950
Zootopia     0.009950
Name: Population, dtype: float64

## Cleaning Series¶

In [76]:
pop_series.notnull()

Out[76]:
Seoul         True
SF            True
NYC           True
Tokyo         True
Typo         False
SF            True
Mechville     True
Zootopia      True
Name: Population, dtype: bool
In [77]:
pop_series.dropna()

Out[77]:
Seoul        10.01
SF            0.84
NYC           8.41
Tokyo        13.62
SF            0.84
Mechville     0.01
Zootopia      0.01
Name: Population, dtype: float64
In [78]:
pop_series.drop_duplicates()

Out[78]:
Seoul        10.01
SF            0.84
NYC           8.41
Tokyo        13.62
Typo           NaN
Mechville     0.01
Name: Population, dtype: float64

### Note that drop duplicates is by value not by key¶

In [79]:
pop_series[~pop_series.index.duplicated(keep="first")]

Out[79]:
Seoul        10.01
SF            0.84
NYC           8.41
Tokyo        13.62
Typo           NaN
Mechville     0.01
Zootopia      0.01
Name: Population, dtype: float64

Perhaps a more clear way to removed duplicates using groupby() which we will return to soon

In [80]:
pop_series.groupby(pop_series.index).first()

Out[80]:
Mechville     0.01
NYC           8.41
SF            0.84
Seoul        10.01
Tokyo        13.62
Typo           NaN
Zootopia      0.01
Name: Population, dtype: float64

Cleaning data in one shot:

In [81]:
pop_series_clean = (pop_series
.groupby(pop_series.index).first()
.dropna()
)


### Ordering Data¶

In [82]:
pop_series_clean.sort_values(ascending=False)

Out[82]:
Tokyo        13.62
Seoul        10.01
NYC           8.41
SF            0.84
Zootopia      0.01
Mechville     0.01
Name: Population, dtype: float64
In [83]:
pop_series_clean.sort_index()

Out[83]:
Mechville     0.01
NYC           8.41
SF            0.84
Seoul        10.01
Tokyo        13.62
Zootopia      0.01
Name: Population, dtype: float64

## Summarizing Series¶

Examining a few elements

In [84]:
pop_series_clean.head(n=3)

Out[84]:
Mechville    0.01
NYC          8.41
SF           0.84
Name: Population, dtype: float64

Computing statistics

In [85]:
pop_series_clean.describe()

Out[85]:
count     6.000000
mean      5.483333
std       5.945363
min       0.010000
25%       0.217500
50%       4.625000
75%       9.610000
max      13.620000
Name: Population, dtype: float64

Notice that even the summary is also a Series

## Basic Visualization¶

In [86]:
pop_series_clean.sort_values(ascending=False).plot.bar()
plt.ylabel('Population in Millions')

Out[86]:
<matplotlib.text.Text at 0x1128be1d0>
In [87]:
pop_series_clean.plot.hist(bins=4)

Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x11285cf98>

# Special series types (Strings)¶

In [88]:
text_series = pd.Series(["1,one", "2,two", "3,three", "4,four"],
name="Numbers")
text_series

Out[88]:
0      1,one
1      2,two
2    3,three
3     4,four
Name: Numbers, dtype: object

### Questions?¶

1. What is the index of this series?
2. How could I split the numbers into separate series?

There is a collection of routines associated with each series:

1. series.str.split()
2. series.str.len()
3. series.str.strip()



You will use these often so learn about them.

In [89]:
text_series.str.split(pat=",", expand=True)

Out[89]:
0 1
0 1 one
1 2 two
2 3 three
3 4 four

# Special Series Types (Dates and Times)¶

Consider the following string data for the three lectures Prof. Gonzalez taught (so far...)

In [90]:
date_strings = pd.Series(["01/17/2017", "01/19/2017", "02/07/2017"],
index=["lec1", "lec2", "lec7"], name="Dates")
date_strings

Out[90]:
lec1    01/17/2017
lec2    01/19/2017
lec7    02/07/2017
Name: Dates, dtype: object

We can use the Pandas built-in date-time parsing facilities pd.to_datetime to convert the strings into date objects:

In [91]:
dates = pd.to_datetime(date_strings)
dates

Out[91]:
lec1   2017-01-17
lec2   2017-01-19
lec7   2017-02-07
Name: Dates, dtype: datetime64[ns]

Then use one of the many date time series operations defined here.

In [92]:
dates.dt.dayofweek

Out[92]:
lec1    1
lec2    3
lec7    1
Name: Dates, dtype: int64

Many of the visualizations are pretty basic but can be useful as you are quickly exploring data.

# Data Frames¶

DataFrames are a programming abstraction for Tables that provide a lot of the syntactic functionality we found useful when working with matrices.

Conceptually a DataFrame is a collection of series (columns) with a common index. Let's work through some basic examples.

### Making a DataFrame¶

In [93]:
import pandas as pd


We can make a DataFrame from a dictionary:

In [94]:
baby_names_dictionary = {
"Name": ["Emma", "Liam", "Noah", "Olivia",  "Sophia"],
"Sex": ["F", "M", "M", "F", "F"],
"Count": [20355, 18281, 19511, 19553, 17327]
}
baby_names_dictionary

Out[94]:
{'Count': [20355, 18281, 19511, 19553, 17327],
'Name': ['Emma', 'Liam', 'Noah', 'Olivia', 'Sophia'],
'Sex': ['F', 'M', 'M', 'F', 'F']}
In [95]:
baby_names = pd.DataFrame(baby_names_dictionary)
baby_names

Out[95]:
Count Name Sex
0 20355 Emma F
1 18281 Liam M
2 19511 Noah M
3 19553 Olivia F
4 17327 Sophia F
Does the order of columns matter? Conceptually no. However we have to be a little careful since Pandas allows us to index columns by their location (avoid doing this).
Does the order of row matter? Conceptually no. However we have to be a little careful since Pandas allows us to index rows by their location (avoid doing this).

## Accessing each column (Series)¶

In [96]:
baby_names['Name']

Out[96]:
0      Emma
1      Liam
2      Noah
3    Olivia
4    Sophia
Name: Name, dtype: object
In [97]:
baby_names[['Name', 'Sex']]

Out[97]:
Name Sex
0 Emma F
1 Liam M
2 Noah M
3 Olivia F
4 Sophia F

## Default Index?¶

What is the default index?

In [98]:
baby_names.index

Out[98]:
RangeIndex(start=0, stop=5, step=1)

Since we never specified the index, a default index was created which numbers each of the rows. We can access each row using this default index:

In [99]:
baby_names.loc[[1,3]]

Out[99]:
Count Name Sex
1 18281 Liam M
3 19553 Olivia F

## Setting an Index¶

In [100]:
baby_names = baby_names.set_index(['Name', 'Sex'])
baby_names

Out[100]:
Count
Name Sex
Emma F 20355
Liam M 18281
Noah M 19511
Olivia F 19553
Sophia F 17327
In [101]:
baby_names.loc[[("Liam", "M"), ("Noah", "M")]]

Out[101]:
Count
Name Sex
Liam M 18281
Noah M 19511

%%bash
wget https://www.ssa.gov/oact/babynames/state/namesbystate.zip
unzip namesbystate.zip

In [102]:
!ls *.TXT

AK.TXT CO.TXT GA.TXT IN.TXT MD.TXT MS.TXT NH.TXT OH.TXT SC.TXT VA.TXT WY.TXT
AL.TXT CT.TXT HI.TXT KS.TXT ME.TXT MT.TXT NJ.TXT OK.TXT SD.TXT VT.TXT
AR.TXT DC.TXT IA.TXT KY.TXT MI.TXT NC.TXT NM.TXT OR.TXT TN.TXT WA.TXT
AZ.TXT DE.TXT ID.TXT LA.TXT MN.TXT ND.TXT NV.TXT PA.TXT TX.TXT WI.TXT
CA.TXT FL.TXT IL.TXT MA.TXT MO.TXT NE.TXT NY.TXT RI.TXT UT.TXT WV.TXT

In [103]:
!head CA.TXT




**Data cleaning question:** _What is the format of this file?_ Comma separated values (CSV)
In [104]:
baby_names = pd.read_csv("CA.TXT")

Out[104]:
CA F 1910 Mary 295
0 CA F 1910 Helen 239
1 CA F 1910 Dorothy 220
2 CA F 1910 Margaret 163
3 CA F 1910 Frances 134
4 CA F 1910 Ruth 128
What went wrong? No header provided in the file so first record was treated as a header.

In [105]:
column_names = ["State", "Sex", "Year", "Name", "Count"]
baby_names = pd.read_csv("CA.TXT", names = column_names)

Out[105]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Looks Good!

For fun let's load the rest of the states (why might we not want to do this?):

In [106]:
import os

file_names = (f for f in os.listdir() if f.endswith(".TXT"))

baby_names = pd.concat(
(pd.read_csv(f, names = column_names) for f in file_names)
).reset_index(drop=True)

In [107]:
baby_names.head()

Out[107]:
State Sex Year Name Count
0 AK F 1910 Mary 14
1 AK F 1910 Annie 12
2 AK F 1910 Anna 10
3 AK F 1910 Margaret 8
4 AK F 1910 Helen 7

How much data do we have?

In [108]:
len(baby_names)

Out[108]:
5743017

How many total people are counted?

In [109]:
baby_names['Count'].sum()

Out[109]:
302037244
**Question:** *Is this number reasonable?* It seems low. However this is what the social security website states: All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

To get a better understanding of how people enrolled in social security you can plot the number of applicants per year.

 baby_names.groupby("Year")['Count'].sum().plot() 

We will explain the following query in a few minutes but let's quickly take a look at the date distribution of the data.

In [110]:
baby_names.groupby("Year")['Count'].sum().plot()
plt.ylabel('Count')

Out[110]:
<matplotlib.text.Text at 0x1204e4160>

You can add derived columns by simply assigning to the DataFrame

In [111]:
baby_names['Len'] = baby_names['Name'].str.len()

Out[111]:
State Sex Year Name Count Len
0 AK F 1910 Mary 14 4
1 AK F 1910 Annie 12 5
2 AK F 1910 Anna 10 4
3 AK F 1910 Margaret 8 8
4 AK F 1910 Helen 7 5

Notice the .str.len() above? There is a large collection of string operations you can apply to series containing strings.

### Computing the last letter¶

In [112]:
baby_names['Last Letter'] = baby_names['Name'].str[-1].str.lower()

Out[112]:
State Sex Year Name Count Len Last Letter
0 AK F 1910 Mary 14 4 y
1 AK F 1910 Annie 12 5 e
2 AK F 1910 Anna 10 4 a
3 AK F 1910 Margaret 8 8 t
4 AK F 1910 Helen 7 5 n

### Zoom in on the data for California in 2015.¶

In [113]:
ca2015 = baby_names[
(baby_names['Year'] == 2015) &
(baby_names['State'] == "CA")
]

len(ca2015)

Out[113]:
6836
In [114]:
ca2015.head()

Out[114]:
State Sex Year Name Count Len Last Letter
571368 CA F 2015 Sophia 2942 6 a
571369 CA F 2015 Mia 2850 3 a
571370 CA F 2015 Emma 2706 4 a
571371 CA F 2015 Olivia 2507 6 a
571372 CA F 2015 Isabella 2388 8 a
• Notice that we retained the index from the original table.
In [115]:
ca2015[ca2015["Name"] == "Joey"]

Out[115]:
State Sex Year Name Count Len Last Letter
572582 CA F 2015 Joey 24 4 y
720491 CA M 2015 Joey 80 4 y

# Grouping¶

We will often want to compute data at a more coarse granularity by aggregating over all records that share a common set of attributes. This process is called grouping. In the following we will explore a few grouping operations. To get an intuition behind grouping consider the following figure:

## Compute the total number of males and females¶

1. How should I split the data?
2. Aggregation function?
3. What will my final table look like?

In [116]:
gender_counts = baby_names.groupby("Sex")[["Count"]].sum()

Out[116]:
Count
Sex
F 145235866
M 156801378

Notice that the groupby operations produces a table that is indexed (keyed) by the group attribute (in this case Sex).

### What are the most common names of all time?¶

1. How should I split the data?
2. How should I aggregate each group?
3. What will my table look like?

In [117]:
name_counts = baby_names.groupby('Name')[['Count']].sum()

Out[117]:
Count
Name
Aaban 12

Visualizing the most popular names

In [118]:
(
name_counts['Count']
.sort_values(ascending=False)
.sort_values()
.plot.barh()
)

Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x112d99eb8>
1. How should I split the data?
2. How should I aggregate each group?
3. What will my table look like?

In [119]:
name_counts = baby_names.groupby(['Sex','Name'])['Count'].sum()

plt.subplot(2,1,1)
(name_counts
.loc['F']
.sort_values(ascending=False)
.sort_values()
.plot.barh()
)
plt.ylabel('Female Names')
plt.xlim(0, name_counts.max())
plt.subplot(2,1,2)
(name_counts
.loc['M']
.sort_values(ascending=False)
.sort_values()
.plot.barh()
)
plt.ylabel('Male Names')
plt.xlim(0, name_counts.max())

Out[119]:
(0, 4954037)
**Question:** What is going on here?

Males seem to be concentrated around Judeo-Christian names while females appear to be more diverse.

## How many uniques are there for men and women?¶

1. How should I split the data?
2. How should I aggregate each group?
3. What will my table look like?

In [120]:
unique_names = baby_names.groupby('Sex')['Name'].nunique()

In [121]:
unique_names.plot.bar()

Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x1129eeb70>

## What names are most gender neutral?¶

1. What does it mean to be commonly associated with both sexes?
2. What computation would I run to compute this value?

#### Proposals:¶

1. For each name compute $|C_M - C_F|$
2. For each name compute $(C_M + C_F)/2$
3. For each name compute $\sqrt{C_M * C_F}$

I will focus on more recent dates:

In [122]:
(baby_names[baby_names['Year'] > 2000]
.groupby(['Name', 'Sex'])['Count'].sum()
.groupby(level ='Name').prod()
.sort_values(ascending=False)
.sort_values()
.plot.barh()
)

Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x1187d71d0>

Notice in the above example it was necessary to use the level argument for the second groupby call. This is because the Name column became an index after the first groupby. If we wanted to avoid this we could do the following:

(baby_names[baby_names['Year'] > 2000]
.groupby(['Name', 'Sex'], as_index=False)['Count'].sum()
.groupby('Name').prod()
.sort_values('Count', ascending=False)
.sort_values('Count')
.plot.barh()
)

In [123]:
(baby_names.groupby("Name")['Count'].sum()
.loc[["Joey", "Joseph", "Bin", "Deborah"]]
.plot.bar()
)

Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x11258fdd8>

## Pivoting¶

Supposed we wanted to study the breakdown with respect to the last letter in the names. We might like a table that looks like:

"M" "F"
"a" 48576568 1560980
"b" 9286 1343336
"c" 17077 1545079

We can build such a table by Pivoting. Pivoting will take the unique values in a column (e.g., Sex: {M, F}) and make those into separate columns. Then we can choose a column (or set of columns) to groupby (e.g., Last Letter : {a, b, ...}) and finally a column for which we want to compute the total (e.g., Count). In essence pivoting is just like groupby except we can choose two dimension along which to group.

In [124]:
last_letter_pivot = baby_names.pivot_table(
index=['Last Letter'], # the row index
columns=['Sex'], # the column values
values='Count', # the field(s) to processed in each group
aggfunc='sum', # group operation
margins=True # show margins All
)
last_letter_pivot

Out[124]:
Sex F M All
Last Letter
a 48576568.0 1560980.0 50137548.0
b 9286.0 1343336.0 1352622.0
c 17077.0 1545079.0 1562156.0
d 562936.0 15387928.0 15950864.0
e 30938194.0 12672282.0 43610476.0
f 452.0 153419.0 153871.0
g 12683.0 516521.0 529204.0
h 6941294.0 6362319.0 13303613.0
i 2931172.0 568135.0 3499307.0
j 896.0 11926.0 12822.0
k 14277.0 4978400.0 4992677.0
l 4921150.0 13754881.0 18676031.0
m 358415.0 5315522.0 5673937.0
n 17642883.0 33756408.0 51399291.0
o 306862.0 3468781.0 3775643.0
p 597.0 645563.0 646160.0
q 85.0 5381.0 5466.0
r 3620215.0 9192751.0 12812966.0
s 3187662.0 15895496.0 19083158.0
t 2167707.0 8811677.0 10979384.0
u 46316.0 46682.0 92998.0
v 2257.0 25993.0 28250.0
w 35666.0 2985708.0 3021374.0
x 16870.0 568499.0 585369.0
y 22820581.0 17172204.0 39992785.0
z 103765.0 55507.0 159272.0
All 145235866.0 156801378.0 302037244.0

We can use the built-in plotting functionality in Pandas to visualize this data quickly:

In [125]:
(
last_letter_pivot
.plot.bar()
)
plt.ylabel("Count")

Out[125]:
<matplotlib.text.Text at 0x11423d2e8>

## This plot is flawed. Why?¶

• We don't want the marginals in the plot
• We are looking at counts and we might be interested in proportions?
• We sorted by letter making it hard to compare various lines?
In [126]:
# normalize the counts
normalized_ll_pivot = (
last_letter_pivot.div(last_letter_pivot.sum(axis=1), axis=0)
)

pink_blue = ["#E188DB", "#334FFF"]
with sns.color_palette(sns.color_palette(pink_blue)):
(normalized_ll_pivot
.sort_values("F",ascending=False) # Sort the plot
.plot.bar()
)
plt.ylabel('Proportion')

**Question:** *How might you use this information to predict Sex from name?* There are certain letters that appear to be disproportionately associated with the Sex of the baby.

## More advanced Pivoting and Grouping¶

Suppose I wanted to compute a pivot table which looked like:

Name  Andrew     Sam
Year
1910   845.0   847.0
1911  1066.0   895.0
1912  1922.0  1295.0
1913  2204.0  1518.0
1914  2957.0  1856.0
...

containing the total number of babies each year with a given name.

Let's fill in the following query

(
baby_names
.pivot_table(
index = ???,
columns = ???,
values = ???,
aggfunc = ???)
)


Suppose we want to track the popularity of the Staff names over time?

In [127]:
staff = ["Andrew", "Sam", "Bin", "Deborah", "Joseph", "Joey"]

In [128]:
(baby_names
.pivot_table(index=['Year'], columns=['Name'], values='Count', aggfunc='sum')
)

Out[128]:
Name Andrew Sam Bin Deborah Joseph Joey
Year
1910 845.0 847.0 NaN 11.0 5226.0 NaN
1911 1066.0 895.0 NaN 12.0 6492.0 NaN
1912 1922.0 1295.0 NaN NaN 12070.0 NaN
1913 2204.0 1518.0 NaN 6.0 14467.0 NaN
1914 2957.0 1856.0 NaN 34.0 18858.0 NaN

Some names don't occur on some years. What should I do about the NaN values?

In [129]:
staff_pivot = (
baby_names
.pivot_table(
index=['Year'], columns=['Name'], values='Count', aggfunc='sum')
.loc[:,staff]
## Replace the NaN Values with 0.0
.fillna(0.0)
)

Out[129]:
Name Andrew Sam Bin Deborah Joseph Joey
Year
1910 845.0 847.0 0.0 11.0 5226.0 0.0
1911 1066.0 895.0 0.0 12.0 6492.0 0.0
1912 1922.0 1295.0 0.0 0.0 12070.0 0.0
1913 2204.0 1518.0 0.0 6.0 14467.0 0.0
1914 2957.0 1856.0 0.0 34.0 18858.0 0.0
In [130]:
staff_pivot.plot(marker='.')
plt.ylabel("Count")

Out[130]:
<matplotlib.text.Text at 0x11830fc18>