Analyzing tabular data with pandas

featuring the human proteome

pandas

Pandas is the de-facto standard to read, analyze and visualize tabular data from CSV files, Excel tables and many more.

The main data structure for tables is called DataFrame:

pip install pandas
pip install matplotlib

import pandas as pd
import pylab as plt

df = pd.read_csv('human_proteome.csv', index_col=0)
df.head()

df['name']
df[['name', 'length']]

df.sum()

df['W'].mean()
df['W'].std()

df.describe()

df[df['length'] > 5000]

df.sort_values('length', ascending=False).head(10)

df['H_percent'] = df['H'] / df['length']
df.sort_values('H_percent', ascending=False).head(20)

def match_name(name, query):
    return query in name

hemo = df['name'].apply(match_name, args=["Hemoglobin"])
df[hemo]

df.plot.scatter('K', 'R')
plt.axis([0, 1000, 0, 500])
plt.savefig('scatterplot.png')

plt.figure()
short = df[df['length'] < 1000]
short['length'].hist(bins=50)
plt.savefig('lengths.png')