vishelper package¶
Subpackages¶
Submodules¶
vishelper.cluster module¶
- vishelper.cluster.create_labels(z, features_in, levels, criteria='distance', feature_names=None)[source]¶
Labels each observation according to what cluster number it would fall into under the specified criteria and level(s).
- Parameters
z – The hierarchical clustering encoded with the matrix returned by Scipy’s linkage function.
features_in – list of features for each sample or dataframe
levels – list of different levels to label samples according to. Will depend on criteria used. If criteria = ‘distance’, clusters are formed so that observations in a given cluster have a cophenetic distance no greater than the level(s) provided. If criteria = ‘maxcluster’, cuts the dendrogram so that no more than the level number of clusters is formed. See http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html for more options.
- Criteria
string referring to which criterion to use for creating clusters. “distance” and “maxclusters” are two more commonly used. See param levels above. Other options can be found at http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html
- Feature_names
list of labels corresponding to the features to create pandas dataframe for output if dataframe is not provided.
- Returns: features: Pandas dataframe with feature values for each observation as well as assigned cluster
number for each specified level. Cluster assignment columns are labeled by str(level).
- vishelper.cluster.dendrogram(z, xlabel='Observations', thresh_factor=0.5, remove_ticks=False, **kwargs)[source]¶
Creates a dendrogram from . Colors clusters below thresh_ factor*max(cophenetic distance).
- Parameters
z – The hierarchical clustering encoded with the matrix returned by Scipy’s linkage function.
xlabel – String for xlabel of figure
thresh_factor – Colors clusters according those formed by cutting the dendrogram at thresh_factor*max(cophenetic distance)
Returns: R (see Scipy dendrogram docs). Displays: Dendrogram.
- vishelper.cluster.heatmap_pca(V, normalize=True, n_feats=None, n_comps=None, cmap=None, feature_names=None, transpose=False, **kwargs)[source]¶
Creates a heatmap of the composition of the principal components given by V. If normalize is left as default (True), the magnitude of the components in V will be normalized to give a percent composition of each feature in V.
- Parameters
V – list of list. PCA components. N components x M features.
normalize – optional boolean, default True, whether to normalize V to relative weights.
n_feats – optional int - number of features to include in figure.
n_comps – optional int - number of components to include in figure.
feature_names – optional list of strings to include feature names in axis labels. Will default to ‘Feature 1’,’Feature 2’, etc if not specified.
Returns nothing, displays a figure.
- vishelper.cluster.separate(features, level, feature_name, minimum_population=10)[source]¶
Separates features into lists based on their cluster label and separates out clusters less than the minimum population in size as outliers.
- Parameters
features – Pandas dataframe with rows for each observation, columns for each feature value and cluster label for each level (see create_labels()).
level – Level at which you want to separate observations into groups.
feature_name – Desired feature for grouping.
minimum_population – Minimum population for which a labeled cluster can be considered a full cluster. Any cluster which has a lower population will be considered a group of outliers.
- Returns: sep_features = list of list of feature values for each labeled group greater
than min_population in size.
- outliers = feature values for any observations in cluster which has a size smaller than
minimum population.
- vishelper.cluster.visualize_clusters(features, level, feature_names, bins=20, xlim=None, ylim=None, log=False)[source]¶
Plots a histogram of the number of samples assigned to each cluster at a given cophentic distance and the distribution of the features for each cluster. This assumes labels exist in the features dataframe in column str(level).
vishelper.colorize module¶
- vishelper.colorize.color_categorical(df, column_to_color, new_color_column='color', colors=None)[source]¶
Adds a column to a dataframe with colors assigned according to the category in the column_to_color
- vishelper.colorize.color_continuous(df, column_to_color, new_color_column='color', clip=True, log10=False, cmap=None, return_all=False, **kwargs)[source]¶
Adds a column to a dataframe with colors assigned according to the continuous value in the column_to_color
- vishelper.colorize.column_to_colors(df, column, colors=None)[source]¶
Takes a column of categorical values and assigns a color to each category.
vishelper.config module¶
vishelper.dfplot module¶
- class vishelper.dfplot.VisDF(df, column_labels=None, labels=None, cluster_label=None, index=None, numeric_columns=None, nonnumeric_columns=None, color_dict=None, columns_to_color=None, colors=None, univariate_ylabels=None, scale=False, pca=False)[source]¶
Bases:
objectEasily create typical visualizations of data in a pandas dataframe.
This class performs a number of plotting tasks for pandas data frames including easy sub-plotting and consistent axes labeling.
- Parameters
df (pandas dataframe) – Data to be plotted
column_labels (
dict) – Dictionary containing mappings of columns of dataframe and corresponding labels to be used instead when plotting for axes labels and legend. If None, column names will have ‘_’ removed and the first letter capitalized (e.g. “path_length” –> “Path length”)cluster_label (
str, optional) – If it exists, the name of the column that gives cluster or group assignments (and is not a feature).index (
strorlistofstr, optional) – Name of identifying column such as customer id or transaction id or other column that should not be analyzed.numeric_columns (
listofstr, optional) – List of column names corresponding to numeric fields. If not provided, this list will be assessed by data type of each column and will exclude the index and cluster label, if given.nonnumeric_columns (
listofstr, optional) – List of column names corresponding to non-numeric fields. If not provided, this list will be assessed by data type of each column and will exclude the index and cluster label, if given.color_dict (
dict) – Optional. Keys correspond to categorical columns where consistent coloring by category is desired. Each key has a dictionary as it’s value with the category names and corresponding colors to use. Dictionary structure is: {‘column_name’:{‘category1’: ‘#colorx’, ‘category2’: ‘#colory’}}columns_to_color (
listofstr, optional) – List of names of categorical columns to apply consistent coloring to. Colors to assign to each category will be provided by the colors attribute.colors (
list) – List of colors to cycle through in plotting (if None provided, will use defaults defined in config file).univariate_ylabels (
dict) – Dictionary of univariate plot types and corresponding y-labels to use. Default is dict(hist=’Count’, barh=’Count’).( (pca) – bool:): Default True. If True, scales the numeric columns of the dataframe and stores them in the scaled attribute.
( – bool:): Default True. If True, calculates the principal components of the numeric data and stores them in the pca attribute.
- add_color_dict(column_name, colors=None)[source]¶
Assign colors to categories within a defined column of the data.
- Parameters
Returns: Nothing
- category_heatmap(category, variables=None, transpose=False, measure=<function mean>, metric='actual', category_dict=None, xlabel=None, ylabel=None, cat_order=None, log10=False, **kwargs)[source]¶
- fscore_by_feature(category_column)[source]¶
Prioritize categorical x continuous interactions to investigate.
The F-test in one-way analysis of variance is used to assess whether the expected values of the variable within the categories in the category column differ from each other. A higher f-value or lower p-value indicates a bigger difference between the categories in a given variable.
This is not meant to be used to make any statistically validated claims about the variables as no assumptions have been considered. Moreover, no adjustment has been made for making doing multiple tests. This should *only * be used as a directional signal of which variables may interact most with the categorical column provided and which should be prioritized for visual investigation.
- Parameters
category – Column name of which variable to group observations by and compare distributes of variables across.
- Returns
- :python:obj:`pandas.core.frame.DataFrame` of variables and
corresponding f-score and p-value.
- Return type
fps
- labeled_scatter(category=None, x=None, y=None, pca=True, **kwargs)[source]¶
Method for visualizing clusters in 2D.
- pca¶
Pandas dataframe of the principal components of the numeric columns of the data.
- percent_above_below(category, quantile_threshold=0.5, how='above', exclude=None, category_dict=None, transpose=False, cluster=True, variables=None, xlabel=' ', ylabel=' ', cat_order=None, **kwargs)[source]¶
- Parameters
category –
quantile_threshold –
how –
exclude –
category_dict –
transpose –
cluster –
variables –
xlabel –
ylabel –
cat_order –
**kwargs –
Returns:
- scaled¶
Pandas dataframe of numeric data scaled.
- subplots(columns_to_plot, kind, color_by=None, sort_by=None, ascending=False, layout=None, titles=None, main_title=None, xlim=None, ylim=None, legend_labels=None, legend_order=None, figsize=(16, 10), counts=False, top_counts=None, min_counts=None, **kwargs)[source]¶
Create figure with many subplots at once from dataframe.
This method will create a figure with subplots based on the column names inputted.
- Parameters
columns_to_plot (
listof [strorlists) –The column(s) to plot in each figure.
- Univariate only: If only plotting univariate vishelper,
columns_to_plot will look like [‘column1’,’column2’,…, ‘columnN’] where ‘column1’ will be plot in figure 1, ‘column2’ in figure 2, etc.
- **Bivariate only*: If only plotting bivariate vishelper,
columns_to_plot will look like [[‘columnx1’, ‘columny1’],[‘columnx2’, ‘columny2], …, ] where ‘columnx1’ will be plotted vs ‘columny1’ in figure 1.
- Mix of univariate and bivariate: If plotting a mix of plot
types, columns_to_plot will look something like: [‘column1’, [‘columnx2’, ‘columny2’], ‘column3’,…]. Note that this will require a mix of plot types and currently **kwargs cannot be provided that don’t work in all plot functions.
kind (
strorlistofstr) – What type of plot to plot. If a string, the plot type is used for each subplot. If a list, it should be the same length ascolumns_to_plotand describe what type of plot to use in each subplot.layout (
tuple, optional) – # of rows x # columns. If not given, the layout will default to N x 2 where N is calculated based on length of columns_to_plotmain_title (
str, optional) – Optional, title for the entire figure.
- :param titles (
listof: obj:`str, optional): List of titles for each subplot corresponding to the order of columns_to_plot
- Parameters
( (min_counts) – bool:): If True, plot
pd.DataFrame.value_counts()is plotted rather than the data frame data. This is typically used for categorical fields.( – int:, optional): If counts is True and this argument is provided, only the first top_counts number of rows from the
pd.DataFrame.value_counts()data when the dataframe is sorted from highest to lowest counts.( – int:, optional): If counts is True and this argument is provided, only the first min_counts number of rows from the
pd.DataFrame.value_counts()data when the dataframe is sorted from lowest to highest counts.**kwargs –
Any other key word arguments will be fed into the plotting function and should be arguments to the core :python:mod:`matplotlib` plotting function (e.g. bin for histograms). Currently cannot feed arguments that don’t apply to all plot kinds being used (as they are automatically filled in).
TO DO: something to allow for keyword arguments to only be fed functions that they apply to
- Returns
:python:obj:matplotlib.figure.Figure` axes: :python:obj:`numpy.ndarray` of :python:obj:`numpy.ndarray` of :python:obj:`matplotlib.axes._subplots.AxesSubplot`
- Return type
fig
- to_labels(columns)[source]¶
Convert column names to figure labels.
Converts a list of column names into labels based on either the provided dictionary column_labels or if not provided, based on the labelfy function which replaces undersores with spaces and capitalizes the first letter of the string.
vishelper.helpers module¶
vishelper.interactive module¶
- vishelper.interactive.interactive_heatmap(df, save_path, ycolumn='dayofweek', xcolumn='weekof', value_column='value', x_range=None, y_range=None, colors=None, vmin=None, vmax=None, bokehtools='hover,save,pan,box_zoom,reset,wheel_zoom', title='', plot_width=900, plot_height=500, min_border_right=0, colorbar_format='%d lbs', x_axis_location='above', y_axis_location='left', toolbar_location='below', colorbar_orientation='vertical', colorbar_place='right', tooltips=None, label_font_size='10pt', xlabel_orientation=None, colorbar_label_standoff=20, colorbar_major_label_text_align='center', xlabel='', ylabel='')[source]¶
Creates an interactive heatmap with tooltips
- Parameters
df –
save_path (str) – Where to save the output
ycolumn (str) – Which column in the dataframe represents the column that indicates which row of the heatmap (default: ‘dayofweek’)
xcolumn (str) – Which column in the dataframe represents the column that indicates which column of the heatmap (default: ‘weekof’)
value_column (str) – Which column in the dataframe the intersection of the row and column should be colored according to.
x_range (list or similar) – The possible row values (e.g. Monday, Tuesday..). Defaults to the unique set of values in the xcolumn
y_range (list or similar) – The possible column values (e.g. Week of Jan 1, Week of Jan 8, …). Defaults to the unique set of values in the ycolumn
colors – Color scale to use. Defaults to palettable.colorbrewer.sequential.BuGn_9.hex_colors
vmin –
vmax –
bokehtools –
title –
plot_width –
plot_height –
min_border_right (int) – Minimum border left between right side of image and border of figure. Default 0. It is recommended to change to ~80 when setting colorbar_orientation to horizontal to allow room for x-axis labels which are oriented pi/3
colorbar_format –
x_axis_location – which side to put the x-axis (column) labels. Default: ‘above’. Options: ‘above’, ‘below’
y_axis_location – which side to put the y-axis (row) labels. Default: ‘left’. Options: ‘left’, ‘right’
colorbar_orientation (str) – How to orient the colorbar, ‘vertical’ or ‘horizontal’. Default: ‘vertical’
colorbar_place (str, optional) – where to add the colorbar (default: ‘right’) Valid places are: ‘left’, ‘right’, ‘above’, ‘below’, ‘center’.
toolbar_location –
tooltips –
label_font_size –
xlabel_orientation (float) – Orientation of labels on x-axis. If left as None, default is pi/3
colorbar_label_standoff (int) – How much space to leave between colorbar and colorbar labels. Default 20. It is recommended to set to ~5 for vertical color bars.
colorbar_major_label_text_align (str) – How to align tick labels to ticks. Default ‘center’.
xlabel (str) – Label for x-axis. Default=””
ylabel (str) – Label for y-axis. Default=””
Returns:
vishelper.plot module¶
- vishelper.plot.plot(x=None, y=None, df=None, kind=None, plot_function=None, ax=None, xlabel=None, ylabel=None, title=None, legend=None, legend_kwargs=None, ticks=None, labels=None, color=None, color_data=None, color_by=None, figsize=None, xlim=None, ylim=None, tight_layout=False, **kwargs)[source]¶
Makes a plot of one or more univariate or multivariate datasets.
Options for kind of plot:
Univariate, continuous data (y=None):
hist (histogram)
Bivariate, continous x continuous data:
scatter
line
Bivariate, categorical x continuous data:
boxplot
bar
barh (horizontal bar plot)
Multivariate, continuous data (x=None, y=None, df != None):
heatmap
- Parameters
x (optional, default None) –
Data or column name(s) to be plotted for univariate plots, along the x-axis for continuous x continuous data, or categorical data for categorical x continuous data.
For one dataset: list or np.array of x-data or str denoting column of df to use for x-data.
More than one dataset: list of list of x-data sets or list of str with columns to be used for x-data.
If None, df must be provided and kind must be ‘heatmap’.
y (optional) – Univariate: None (default) For one bivariate dataset: list or np.array of y-data or str denoting column of df to use for y-data. More than one bivariate dataset: list of list of y-data sets or list of str with columns to be used for y-data.
kind (
str, optional) –Type of plot. Defaults to None, which implies plot_function must be given. Options for kind:
Univariate, continuous data (y=None):
hist (histogram)
Bivariate, continous x continuous data:
scatter
line
Bivariate, categorical x continuous data:
boxplot
bar
barh (horizontal bar plot)
Multivariate, continuous data (x=None, y=None, df != None):
heatmap
df – Dataframe containing the data to be plotted. Default None. Data must be provided in x (and y if bivariate)
plot_function – Default None. If “kind” is not given, can provide a plot function with the form plot_function(x, y, ax, **kwargs) or plot_function(x, ax, **kwargs).
ax (
matplotlib.axes._subplots.AxesSubplot, optional) – Matplotlib axes handle. Default is None and a new ax is generated along with the fig.xlabel (
str, optional) – Label for x axis. Default None.ylabel (
str, optional) – Label for y axis. Default None.title (
str, optional) – Plot title. Default None.legend (bool, optional) – A legend will be displayed if legend=True or legend=None and more than one thing is being plot. Default None.
( (legend_kwargs) – obj`dict`, optional): Dictionary of keyword arguments for legend (see legend for more when legend will be displayed). Default None.
labels (
listofstr, optional) – List of labels for legend. Default None. If legend is True, and no labels are provided, labels will be “Set N” where N is the ordering of the inputs.ticks (
listofstr, optional) – List of tick labels. Default None.color (
strorlistofstr, optional) – Color to plot. List of colors if there is more than one thing being plot. Default is None, in which case, if color_data is also None, default colors from vishelper.config.formatting[“color.all”] will be used.color_data (
listofstr, optional) – If provided, list should be the same length of the data, providing an individual color for each data point. Default None in which case all points will be colored according to color argument.color_by (
str) – Which column to color by if providing a dataframe in dffigsize (tuple, optional) – Figure size. Default is None and plot will be sized based on vishelper.config.formatting[‘figure.figsize’].
xlim (tuple, optional) – Tuple of minimum and maximum x values to plot (xmin, xmax). Default is None and matplotlib chooses these values.
ylim (tuple, optional) – Tuple of minimum and maximum y values to plot (ymin, ymax). Default is None and matplotlib chooses these values.
tight_layout (bool, optional) – Envokes plt.tight_layout() to ensure enough space is given to plot elements
**kwargs – These kwargs are any that are relevant to the plot kind provided.
- Returns
Matplotlib figure object ax (
matplotlib.axes._subplots.AxesSubplot): Axes object with the plot(s)- Return type
fig (matplotlib.figure.Figure)
vishelper.reformat module¶
- vishelper.reformat.add_labels(ax, xlabel=None, ylabel=None, title=None, main_title=None)[source]¶
Adds xlabel, ylabel, title, main_title, if provided to ax with size given by the formatting dict.
- vishelper.reformat.adjust_lims(ax, xlim=None, ylim=None)[source]¶
Adjusts the x-axis and y-axis view limits of ax if xlim and/or ylim are provided.
- Parameters
ax (
matplotlib.axes._subplots.AxesSubplot) – Matplotlib axes handlexlim (tuple, optional) – Tuple of (x_min, x_max) giving the range of x-values to view in the plot. If xlim=None (default), the x-axis view limits will not be changed.
ylim (tuple, optional) – Tuple of (y_min, y_max) giving the range of y-values to view in the plot. If ylim=None (default), the x-axis view limits will not be changed.
- Returns
Matplotlib axes handle with adjusted xlim and ylim
- Return type
ax (
matplotlib.axes._subplots.AxesSubplot)
- vishelper.reformat.fake_legend(ax, legend_labels, colors, marker=None, size=None, fontsize=None, linestyle='', loc=None, bbox_to_anchor=None, where='best', **kwargs)[source]¶
Adds a fake legend to the plot with the provided legend labels and corresponding colors and attributes.
- Parameters
ax (
matplotlib.axes._subplots.AxesSubplot) – Matplotlib axes handlelegend_labels (list of str) – Labels for the legend items.
colors (list) – List of colors of the items in the legend.
marker (str or list of str, optional) – Marker for the items in the legend. Defaults to formatting[‘legend.marker’]
size (str or list of str, optional) – Marker size for the items in the legend. Defaults to formatting[‘markersize’]
where (str, optional) – Where to put the legend. Options are right, below, and best
linestyle (str or list of str, optional) – Line style for items in legend. Defaults to “” (no line).
loc ('str`, optional) – Location for where to place the legend. Defaults to “best”.
bbox_to_anchor (tuple) – Where to anchor legend, used to place legend outside plot. Default None.
**kwargs – Keyword arguments passed to ax.legend()
- Returns
Matplotlib axes handle with fake legend
- Return type
ax (
matplotlib.axes._subplots.AxesSubplot)