Sep 3, 2021 - Continuous Variables

Histogram

plt.figure(figsize=(12,6))
sns.histplot(df['SalePrice'])

png

Line Plot

[Back to top]

plt.figure(figsize=(15,8))
sns.lineplot(data=df[:100]['SalePrice'])

png

Violin Plot

[Back to top]

plt.figure(figsize=(12,6))
sns.violinplot(x = df['SaleCondition'], y = df['SalePrice'])
plt.axhline(df[df['SaleCondition'] == 'Normal']['SalePrice'].mean(),\
            color='r',linestyle='dashed',label='normal_avg')
plt.legend()

png

Box Plot

[Back to top]

These plots are useful for outlier detection.

Horizontal

plt.figure(figsize=(12,6))
sns.boxplot(data=df, y='SaleCondition', x='SalePrice', orient='h')

png

Vertical

plt.figure(figsize=(12,6))
sns.boxplot(data=df, x='SaleCondition', y='SalePrice', orient='v')

png

Ridge Line Plot

[Back to top]

ridge_plot = sns.FacetGrid(df, row="SaleCondition", hue="SaleCondition", aspect=5, height=1.25)  
ridge_plot.map(sns.kdeplot, 'SalePrice', shade=True)
ridge_plot.map(plt.axhline)
ridge_plot.fig.subplots_adjust(hspace=0.35)

png

QQ Plots

[Back to top]

Source: https://seaborn-qqplot.readthedocs.io/en/latest/

from seaborn_qqplot import pplot

pplot(df.iloc[:250,:], x='YearBuilt', y='SalePrice', kind='qq', height=4, aspect=2)

png

Aug 3, 2021 - Categorical Variables - Barcharts

Categorical Variables - Barcharts

sns.set_style('darkgrid')

Faceted Bar Chart

seaborn

g = sns.catplot(x="sex", y="count",
                hue="survived", col="pclass",
                data=df, kind="bar",
                height=6, aspect=.7, palette="flare");

png

Basic Bar Chart

df_copy2  = df['sex'].value_counts().reset_index()
df_copy2.columns = ['gender', 'count']
df = df_copy2

seaborn

plt.figure(figsize=(8,4))
plt.title('Titanic Gender Distribution')

sns.barplot(x='gender', y='count', data=df, palette='pastel', alpha=0.9)

png

matplotlib

plt.title('Titanic Gender Distribution')
plt.bar(x=df['gender'], height=df['count'], color=['blue', 'red'], alpha=0.4, width=0.4)
plt.xlabel('Gender')
plt.ylabel('Count')

png

Horizontal Bar charts

seaborn

# Flip the x and y variables
plt.figure(figsize=(8,4))
plt.title('Titanic Gender Distribution')
sns.barplot(x='count', y='gender', data=df, palette='pastel', alpha=0.5)

png

matplotlib

# y and width are the passed params
plt.title('Titanic Gender Distribution')
plt.barh(y=df['gender'], width=df['count'], color=['blue', 'red'], alpha=0.4)
plt.xlabel('Gender')
plt.ylabel('Count')

png

Reordering the bars

seaborn

# notice the order parameter
plt.figure(figsize=(8,4))
plt.title('Titanic Gender Distribution')

sns.barplot(x='gender', y='count', data=df, palette='pastel', alpha=0.9, order=['male', 'female'])

png

matplotlib

Done by ordering the dataframe and then plotting

Jul 2, 2021 - Categorical Variables - Cleveland Dot Plots

Cleveland Dot Plots

Cleveland Dot Plot

matplotlib

plt.figure(figsize=(20,10))
plt.hlines(y=my_range, xmin=0, xmax=df['fare'], color='skyblue')
plt.grid(True)
plt.plot(df['fare'], my_range, "o")
plt.yticks(my_range, df['name'])
plt.title("Ticket Price Dot Plot", loc='left')
plt.xlabel('Ticket Price')
plt.ylabel('Name')

png

Multiple Dots

matplotlib

Sorting has to be done through dataframe only.

plt.figure(figsize=(20,10))
plt.hlines(y=my_range, xmin=0, xmax=df['fare'], color='skyblue')
plt.hlines(y=my_range, xmin=0, xmax=df['age'], color='red')

plt.grid(True)
plt.plot(df['fare'], my_range, "o")
plt.plot(df['age'], my_range, "o")

plt.yticks(my_range, ordered_df['name'])
plt.title("Ticket Price Dot Plot", loc='left')
plt.xlabel('Ticket Price')
plt.ylabel('Name')

png

Jun 1, 2021 - Multivariate Continuous

Content

About the dataset
What is “Multivariate”
Scatter Plot
Simple Scatter Plot
- Scatter Plot with regression line
- Faceted groups
Basic Pair Plot
Category Wise Pair Plot
Countor lines
Correlations
Diverging Palette Red
Diverging Palette Blue with upper triangle Mask
Plasma Palette with upper triangle Mask
Joint Plots
Parallel Plots
References

About the dataset

Context

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Acknowledgements

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.

Inspiration

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

What is “Multivariate”

Multivariate data analysis is a set of statistical models that examine patterns in multidimensional data by considering, at once, several data variables. It is an expansion of bivariate data analysis, which considers only two variables in its models. As multivariate models consider more variables, they can examine more complex phenomena and find data patterns that more accurately represent the real world.

Scatter Plot

Basic ScatterPlot

Rough idea about the relation between variables through the scatter plot, need correlation matrix for better understanding

plt.figure(figsize=(12,6), dpi=140)
num_col1 = 'BMI'
num_col2 = 'BloodPressure'
target= 'Outcome'
cat_num_col1='Pregnancies'
cat_num_col2 ='Age'

sns.scatterplot(x=num_col1, y=num_col2, data=df,
                style=target, hue=cat_num_col1, 
                size=cat_num_col2, alpha=0.7, palette = 'plasma',
)#,sizes=(20,100), hue_norm=(0,15))

png

With regression Line

Rough idea about the relation between variables through the scatter plot, need correlation matrix for better understanding

plt.figure(figsize=(12,6), dpi=140)
num_col1 = 'BMI'
num_col2 = 'BloodPressure'
target= 'Outcome'
cat_num_col1='Pregnancies'
cat_num_col2 ='Age'

sns.lmplot(x=num_col1, y=num_col2, markers=['o','x'], hue=target, data=df, fit_reg=True)

png

Faceted groups

num_col1 = 'BMI'
num_col2 = 'BloodPressure'
target= 'Outcome'
cat_num_col1='Pregnancies'
cat_num_col2 ='Age'

sns.relplot(
    data=df, x=num_col1, y=num_col2,
    col=target, hue=cat_num_col1, size=cat_num_col2, style = target,palette = 'plasma',
    kind="scatter"#,aspect=0.5, height=12
)

png

Basic Pair Plot

Observe the distribution for skewness and outliers in the diagonal of the pair plot.
Rough idea about the relation between variables through the scatter plot, need correlation matrix for better understanding

plt.figure(dpi=140)
sns.pairplot(df)
plt.show()

png

Category Wise Pair Plot

Observe the various scatter plots for linear seperability to hypothesize linear/non-linear model
the density curve on the diagonal point normality of the variables, in this example skewness exist, can be due to outliers, (can try to remove them and re-plot)

plt.figure(dpi = 140)
sns.pairplot(df,hue = 'Outcome',palette = 'plasma')
plt.legend(['Non Diabetic','Diabetic'])
plt.show()

png

Contour lines Plot

sns.kdeplot(data=df, x=num_col1, y=num_col2, hue=target,fill=True,alpha=0.5,palette = 'plasma')

png

Correlations

All correlations less than or around 0.5. Therefore, Not very strong linear correlations.

Diverging Palette Red

plt.figure(figsize= (14,8))
# cmap=sns.diverging_palette(5, 250, as_cmap=True)
cmap = sns.diverging_palette(250, 10, as_cmap=True)
ax = sns.heatmap(df.corr(),center = 0,annot= True,linewidth=0.5,cmap= cmap)

png

Diverging Palette Blue with upper triangle Mask

corr = df.corr()
plt.figure(figsize=(14,8))
cmap=sns.diverging_palette(5, 250, as_cmap=True)
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,cmap=cmap,center = 0,annot=True)

png

Plasma Palette with upper triangle mask

plt.figure(dpi = 80,figsize= (14,8))
mask = np.triu(np.ones_like(df.corr(),dtype = bool))
sns.heatmap(df.corr(),mask = mask, fmt = ".2f",annot=True,lw=1,cmap = 'plasma')
plt.yticks(rotation = 0)
plt.xticks(rotation = 90)
plt.title('Correlation Heatmap')
plt.show()

png

Joint Plots

Dig deeper for each variable and its association with other variables.
in this example,
- Glucose shows positive weak linear association with other variable in given dataset.
  That means On increasing Glucose level in patients, Other variables will also increase. Weak linear association is good because we can escape out from Multicollinearity effect in Predective Modelling.

plt.figure(dpi = 100, figsize = (5,4))
comparsion_variable = 'Glucose'
target = 'Outcome'

print("Joint plot of {} with Other Variables ==> \n".format(comparsion_variable))
for i in  df.columns:
    if i != comparsion_variable and i != target:
        print(f"Correlation between {comparsion_variable} and {i} ==> ",df.corr().loc[comparsion_variable][i])
        sns.jointplot(x=comparsion_variable,y=i,data=df,kind = 'reg',color = 'purple')
        plt.show()

Joint plot of Glucose with Other Variables ==> 

Correlation between Glucose and Pregnancies ==>  0.129458671499273

png

Correlation between Glucose and BloodPressure ==>  0.15258958656866448

png

Correlation between Glucose and SkinThickness ==>  0.057327890738176825

png

Correlation between Glucose and Insulin ==>  0.3313571099202081

png

Correlation between Glucose and BMI ==>  0.22107106945898305

png

Correlation between Glucose and DiabetesPedigreeFunction ==>  0.1373372998283708

png

Correlation between Glucose and Age ==>  0.26351431982433376

png

Parallel Plots

from pandas.plotting import parallel_coordinates

numeric_cols = ['BloodPressure','Pregnancies','BMI','SkinThickness','Glucose',target]

tdf = df.sample(100)

parallel_coordinates(tdf[numeric_cols], target, color = ['r','b'])

png

Refrences

https://www.kaggle.com/ravichaubey1506/multivariate-statistical-analysis-on-diabetes/notebook
https://seaborn.pydata.org/generated/seaborn.scatterplot.html
https://www.kaggle.com/princeashburton/multivariate-plotting

May 2, 2021 - Categorical Variables - Heatmaps

Binplots

Hexagonal Binplots

matplotlib

fig, axs = plt.subplots(ncols=1, sharey=True, figsize=(10, 6))
fig.subplots_adjust(hspace=0.5, left=0.07, right=0.93)
ax = axs
hb = ax.hexbin(df["age"], df["fare"], gridsize=5, cmap='Blues', alpha = 0.9)
ax.axis([min(df['age']), max(df['age']), min(df['fare']), max(df['fare'])])
ax.set_title("Hexagon binning")
cb = fig.colorbar(hb, ax=ax)
cb.set_label('counts')
ax.set_xlabel('Age')
ax.set_ylabel('Fare')

png

seaborn

x = sns.jointplot(data=df, x="age", y="fare", kind="hist")
x.ax_joint.set_title("Square Binplots with distributions", pad=70.0)
cb = fig.colorbar(hb, ax=x.ax_marg_y)
cb.set_label('counts')
print(ax)
plt.show()

png

x = sns.jointplot(data=df, x="age", y="fare", kind="hex")
x.ax_marg_y.axis([min(df['age']), max(df['age']), min(df['fare']), max(df['fare'])])
cb = fig.colorbar(hb, ax=x.ax_marg_y)
cb.set_label('counts')

png

Apr 2, 2021 - Categorical Variables - Mosaic Plots

Mosaic Plots

from statsmodels.graphics.mosaicplot import mosaic

#df = pd.DataFrame({'size' : ['small', 'large', 'large', 'small', 'large', 'small', 'large', 'large'], 'length' : ['long', 'short', 'medium', 'medium', 'medium', 'short', 'long', 'medium'], 'temp' : ['cold', 'hot', 'cold', 'warm', 'warm', 'cold', 'hot', 'warm']})

props = {}
single_low = 28
max_start = 255
max_start_oth = 229
diff = 25
r,g,b=max_start,max_start_oth,max_start_oth
for x in df['sex'].unique(): #unique colums in each
    for y in df['pclass'].unique():
        col = '#{}{}{}'.format(format(int(r),'02x'),format(int(g),'02x'),format(int(b),'02x'))
        for z in df['survived'].unique():
            props[(str(z), str(y), str(x))] ={'color': col}
            if r==max_start:
                g-=diff
                b-=diff
            elif b==max_start:
                r-=diff
                g-=diff
            elif g==max_start:
                r-=diff
                b-=diff
            if (g<single_low and b<single_low):
                b,r,g=max_start,max_start_oth,max_start_oth
            elif (r<single_low and g<single_low):
                g,b,r = max_start,max_start_oth,max_start_oth
            elif (r<single_low and b<single_low):
                print("no more colors")

import matplotlib as mpl
from statsmodels.graphics.mosaicplot import mosaic
mpl.rc("figure", figsize=(14,5))
mosaic(df, ['survived', 'pclass', 'sex'], properties=props, title='Survival of Passengers on Titanic - Mosaic ')
plt.show()

png

Mar 2, 2021 - Categorical Variables - Others

Other Categorical Plots

Correlation Plots

plt.figure(figsize= (14,8))
# cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
cmap = sns.diverging_palette(250, 10, as_cmap=True)
ax = sns.heatmap(df.corr(),center = 0,annot= True,linewidth=0.5,cmap= cmap)
plt.title('Heatmap for categorical variables in Titanic Dataset', size=15)

png

Symmetric Matrix - hence only showing the lower half

corr = df.corr()
plt.figure(figsize=(14,8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,cmap=cmap,center = 0)
plt.title('Heatmap for categorical variables in Titanic Dataset - Partial Matrix', size=15)

png

Barplots for each Categorical Column

survived_columns = ['pclass', 'survived', 'sibsp', 'parch']
for col in survived_columns:
    val = df[col].value_counts(dropna=False)
    if(len(val.index)>100):
        print("Too many Categories in "+col)
        continue
    sns.barplot(x=val.index,y=val.values,label=True, alpha=0.8)
    plt.title(col)
    plt.ylabel('Count')
    plt.xlabel("Classes")
    plt.grid('True')
    plt.show()

png

Circle Charts - Boolean Columns

import itertools
default = df[df["survived"]==1]
non_default = df[df["survived"]==0]

d_cols =['pclass', 'survived', 'sibsp', 'parch']
d_length = len(d_cols)

fig = plt.figure(figsize=(16,4))
for i,j in itertools.zip_longest(d_cols,range(d_length)):
    plt.subplot(1,4,j+1)
    default[i].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism"),startangle = 90,
                                        wedgeprops={"linewidth":1,"edgecolor":"white"},shadow =True)
    circ = plt.Circle((0,0),.7,color="white")
    plt.gca().add_artist(circ)
    plt.ylabel("")
    plt.title(i+"-Survivor")


fig = plt.figure(figsize=(16,4))
for i,j in itertools.zip_longest(d_cols,range(d_length)):
    plt.subplot(1,4,j+1)
    non_default[i].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism",3),startangle = 90,
                                           wedgeprops={"linewidth":1,"edgecolor":"white"},shadow =True)
    circ = plt.Circle((0,0),.7,color="white")
    plt.gca().add_artist(circ)
    plt.ylabel("")
    plt.title(i+"-Dead")

png

categorical_columns = ['pclass', 'survived', 'sibsp', 'parch']
target = "survived"

for col in categorical_columns:
	plt.figure(figsize=(16,8))
	plt.subplot(121)
	df[df[target]==0][col].value_counts().plot.pie(fontsize=9,autopct = "%1.0f%%",colors = sns.color_palette("Set1"),
	wedgeprops={"linewidth":2,"edgecolor":"white"},shadow =True)
	circ = plt.Circle((0,0),.7,color="white")
	plt.gca().add_artist(circ)
	plt.title("Distribution of "+col+" type for target==0",color="b")

	plt.subplot(122)
	df[df[target]==1][col].value_counts().plot.pie(fontsize=9,autopct = "%1.0f%%", colors = sns.color_palette("Set1"),
	wedgeprops={"linewidth":2,"edgecolor":"white"},shadow =True)
	circ = plt.Circle((0,0),.7,color="white")
	plt.gca().add_artist(circ)
	plt.title("Distribution of "+col+" type for target==1",color="b")
	plt.ylabel("")
	plt.show()

png

Feb 1, 2021 - Spatial Data

Spatial Data

Chloropleth Maps

[Back to top]

#!conda install -c plotly plotly
import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)

df_px = px.data.election()
geojson = px.data.election_geojson()

fig = px.choropleth(df_px, geojson=geojson, color="Bergeron",
                    locations="district", featureidkey="properties.district",
                    projection="mercator"
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
#fig.show() # Use this to render the plot in your notebook

py.plot({"data": fig}, output_type="div", show_link="False", include_plotlyjs="False", link_text="") # For HTML rendering

Jan 2, 2021 - Interactive Graphs

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import plotly.offline as py
import plotly.graph_objects as go
py.init_notebook_mode(connected=True)

df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/school_earnings.csv")

data = [go.Bar(x=df.School,
            y=df.Gap)]
layout = go.Layout(dict(title = "Male and Female gap in schools",
                  xaxis = dict(title = 'Number of students'),
                  yaxis = dict(title = 'School Name'),
                  ))

py.plot(dict(data=data, layout=layout), include_plotlyjs=False, output_type='div')

# py.iplot(dict(data=data, layout=layout),filename='Basic bar plot')

Mar 10, 2020 - Time Series

%matplotlib inline
sns.set_style('darkgrid')

Content

About the dataset
Line Plot
Plot Multiple time series
Seasonal and trend Components
- Visual analysis of Seasonality
Check stationarity visually using rolling mean
References

About the dataset

This dataset is originally from the yahoo finance website. For IBM company, ‘open’, ‘high’, ‘low’, ‘close’, ‘adj_close’, ‘volume’ data.

Line Plot

f, ax = plt.subplots(nrows=6, ncols=1, figsize=(15, 30))

for i, col in enumerate(df.drop('date', axis=1).columns):
    sns.lineplot(x='date', y=col,data=df, ax=ax[i], color='dodgerblue')
    ax[i].set_title('Feature: {}'.format(col), fontsize=14)
    ax[i].set_ylabel(ylabel=col, fontsize=14)

png

Plot Multiple time series

plt.figure(figsize=(12,6))
sns.lineplot(data=df[['adj_close','open','date']].set_index('date'))

png

Check stationarity visually using rolling mean

# A year has 52 weeks (52 weeks * 7 days per week) aporx.
rolling_window = 52
f = plt.figure(figsize=(15, 6))
ax = plt.gca()

sns.lineplot(x=df['date'], y=df['adj_close'], color='dodgerblue')
sns.lineplot(x=df['date'], y=df['adj_close'].rolling(rolling_window).mean(),  color='black', label='rolling mean')
sns.lineplot(x=df['date'], y=df['adj_close'].rolling(rolling_window).std(), color='orange', label='rolling std')
ax.set_title('Adjusted Close: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax.set_ylabel(ylabel='Adjusted Close Price', fontsize=14)
ax.set_xlim([pd.to_datetime('2020-01-01', format='%Y-%m-%d'), pd.to_datetime('2020-12-31', format='%Y-%m-%d')])

plt.tight_layout()
plt.show()

png

Seasonal and trend Components

from statsmodels.tsa.seasonal import seasonal_decompose

core_columns =  ['adj_close','volume']

for column in core_columns:
    decomp = seasonal_decompose(df[column], period=52, model='additive', extrapolate_trend='freq')
    df[f"{column}_trend"] = decomp.trend
    df[f"{column}_seasonal"] = decomp.seasonal

fig, ax = plt.subplots(ncols=2, nrows=4, sharex=True, figsize=(16,8))

for i, column in enumerate(['adj_close', 'volume']):
    
    res = seasonal_decompose(df[column], freq=52, model='additive', extrapolate_trend='freq')

    ax[0,i].set_title('Decomposition of {}'.format(column), fontsize=16)
    res.observed.plot(ax=ax[0,i], legend=False, color='dodgerblue')
    ax[0,i].set_ylabel('Observed', fontsize=14)

    res.trend.plot(ax=ax[1,i], legend=False, color='dodgerblue')
    ax[1,i].set_ylabel('Trend', fontsize=14)

    res.seasonal.plot(ax=ax[2,i], legend=False, color='dodgerblue')
    ax[2,i].set_ylabel('Seasonal', fontsize=14)
    
    res.resid.plot(ax=ax[3,i], legend=False, color='dodgerblue')
    ax[3,i].set_ylabel('Residual', fontsize=14)

plt.show()

png

Visual analysis of Seasonality

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 12))
f.suptitle('Seasonal Components of Features', fontsize=16)

for i, column in enumerate(core_columns):
    sns.lineplot(x=df['date'], y=df[column + '_seasonal'], ax=ax[i], color='dodgerblue', label='P25')
    ax[i].set_ylabel(ylabel=column, fontsize=14)
    ax[i].set_xlim([pd.to_datetime('2020-01-01', format='%Y-%m-%d'), pd.to_datetime('2020-12-31', format='%Y-%m-%d')])
    
plt.tight_layout()
plt.show()

png

Refrences

https://www.kaggle.com/andreshg/timeseries-analysis-a-complete-guide

Page: 1 of 2 Older

EDAV (using Python)

Group 28 - community contribution project for EDAV 5702

Keertan Krishnan - kk3446

Rahul Agarwal - ra3097

Shaurya Malik- sm4969

Contents:

Histogram

Line Plot

Violin Plot

Box Plot

Ridge Line Plot

QQ Plots

Categorical Variables - Barcharts

Faceted Bar Chart

seaborn

Basic Bar Chart

seaborn

matplotlib

Horizontal Bar charts

seaborn

matplotlib

Reordering the bars

seaborn

matplotlib

Cleveland Dot Plots

Cleveland Dot Plot

matplotlib

Multiple Dots

matplotlib

Context

Content

Acknowledgements

Inspiration

Binplots

Hexagonal Binplots

matplotlib

seaborn

Mosaic Plots

Other Categorical Plots

Correlation Plots

Symmetric Matrix - hence only showing the lower half

Barplots for each Categorical Column

Circle Charts - Boolean Columns

Spatial Data

Contents:

Folium Maps

Chloropleth Maps