vishelper package

Submodules

vishelper.cluster module

vishelper.cluster.create_labels(z, features_in, levels, criteria='distance', feature_names=None)[source]

Labels each observation according to what cluster number it would fall into under the specified criteria and level(s).

Parameters
  • z – The hierarchical clustering encoded with the matrix returned by Scipy’s linkage function.

  • features_in – list of features for each sample or dataframe

  • levels – list of different levels to label samples according to. Will depend on criteria used. If criteria = ‘distance’, clusters are formed so that observations in a given cluster have a cophenetic distance no greater than the level(s) provided. If criteria = ‘maxcluster’, cuts the dendrogram so that no more than the level number of clusters is formed. See http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html for more options.

Criteria

string referring to which criterion to use for creating clusters. “distance” and “maxclusters” are two more commonly used. See param levels above. Other options can be found at http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html

Feature_names

list of labels corresponding to the features to create pandas dataframe for output if dataframe is not provided.

Returns: features: Pandas dataframe with feature values for each observation as well as assigned cluster

number for each specified level. Cluster assignment columns are labeled by str(level).

vishelper.cluster.dendrogram(z, xlabel='Observations', thresh_factor=0.5, remove_ticks=False, **kwargs)[source]

Creates a dendrogram from . Colors clusters below thresh_ factor*max(cophenetic distance).

Parameters
  • z – The hierarchical clustering encoded with the matrix returned by Scipy’s linkage function.

  • xlabel – String for xlabel of figure

  • thresh_factor – Colors clusters according those formed by cutting the dendrogram at thresh_factor*max(cophenetic distance)

Returns: R (see Scipy dendrogram docs). Displays: Dendrogram.

vishelper.cluster.heatmap_pca(V, normalize=True, n_feats=None, n_comps=None, cmap=None, feature_names=None, transpose=False, **kwargs)[source]

Creates a heatmap of the composition of the principal components given by V. If normalize is left as default (True), the magnitude of the components in V will be normalized to give a percent composition of each feature in V.

Parameters
  • V – list of list. PCA components. N components x M features.

  • normalize – optional boolean, default True, whether to normalize V to relative weights.

  • n_feats – optional int - number of features to include in figure.

  • n_comps – optional int - number of components to include in figure.

  • feature_names – optional list of strings to include feature names in axis labels. Will default to ‘Feature 1’,’Feature 2’, etc if not specified.

Returns nothing, displays a figure.

vishelper.cluster.reset_mpl_format()[source]

Reapplies the matplotlib format update.

vishelper.cluster.separate(features, level, feature_name, minimum_population=10)[source]

Separates features into lists based on their cluster label and separates out clusters less than the minimum population in size as outliers.

Parameters
  • features – Pandas dataframe with rows for each observation, columns for each feature value and cluster label for each level (see create_labels()).

  • level – Level at which you want to separate observations into groups.

  • feature_name – Desired feature for grouping.

  • minimum_population – Minimum population for which a labeled cluster can be considered a full cluster. Any cluster which has a lower population will be considered a group of outliers.

Returns: sep_features = list of list of feature values for each labeled group greater

than min_population in size.

outliers = feature values for any observations in cluster which has a size smaller than

minimum population.

vishelper.cluster.visualize_clusters(features, level, feature_names, bins=20, xlim=None, ylim=None, log=False)[source]

Plots a histogram of the number of samples assigned to each cluster at a given cophentic distance and the distribution of the features for each cluster. This assumes labels exist in the features dataframe in column str(level).

vishelper.colorize module

vishelper.colorize.color_categorical(df, column_to_color, new_color_column='color', colors=None)[source]

Adds a column to a dataframe with colors assigned according to the category in the column_to_color

vishelper.colorize.color_continuous(df, column_to_color, new_color_column='color', clip=True, log10=False, cmap=None, return_all=False, **kwargs)[source]

Adds a column to a dataframe with colors assigned according to the continuous value in the column_to_color

vishelper.colorize.column_to_colors(df, column, colors=None)[source]

Takes a column of categorical values and assigns a color to each category.

vishelper.colorize.create_colorbar(ax, cmap, norm, where='right', size='5%', pad=0.25, label=None)[source]

Adds a color bar as defined by the provided cmap and norm

vishelper.colorize.get_plot_color(color_data, color)[source]

vishelper.config module

vishelper.dfplot module

class vishelper.dfplot.VisDF(df, column_labels=None, labels=None, cluster_label=None, index=None, numeric_columns=None, nonnumeric_columns=None, color_dict=None, columns_to_color=None, colors=None, univariate_ylabels=None, scale=False, pca=False)[source]

Bases: object

Easily create typical visualizations of data in a pandas dataframe.

This class performs a number of plotting tasks for pandas data frames including easy sub-plotting and consistent axes labeling.

Parameters
  • df (pandas dataframe) – Data to be plotted

  • column_labels (dict) – Dictionary containing mappings of columns of dataframe and corresponding labels to be used instead when plotting for axes labels and legend. If None, column names will have ‘_’ removed and the first letter capitalized (e.g. “path_length” –> “Path length”)

  • cluster_label (str, optional) – If it exists, the name of the column that gives cluster or group assignments (and is not a feature).

  • index (str or list of str, optional) – Name of identifying column such as customer id or transaction id or other column that should not be analyzed.

  • numeric_columns (list of str, optional) – List of column names corresponding to numeric fields. If not provided, this list will be assessed by data type of each column and will exclude the index and cluster label, if given.

  • nonnumeric_columns (list of str, optional) – List of column names corresponding to non-numeric fields. If not provided, this list will be assessed by data type of each column and will exclude the index and cluster label, if given.

  • color_dict (dict) – Optional. Keys correspond to categorical columns where consistent coloring by category is desired. Each key has a dictionary as it’s value with the category names and corresponding colors to use. Dictionary structure is: {‘column_name’:{‘category1’: ‘#colorx’, ‘category2’: ‘#colory’}}

  • columns_to_color (list of str, optional) – List of names of categorical columns to apply consistent coloring to. Colors to assign to each category will be provided by the colors attribute.

  • colors (list) – List of colors to cycle through in plotting (if None provided, will use defaults defined in config file).

  • univariate_ylabels (dict) – Dictionary of univariate plot types and corresponding y-labels to use. Default is dict(hist=’Count’, barh=’Count’).

  • ( (pca) – bool:): Default True. If True, scales the numeric columns of the dataframe and stores them in the scaled attribute.

  • ( – bool:): Default True. If True, calculates the principal components of the numeric data and stores them in the pca attribute.

add_color_dict(column_name, colors=None)[source]

Assign colors to categories within a defined column of the data.

Parameters
  • column_name (str) – Name of categorical column in the data.

  • colors (list of str) – Optional. Colors to be assigned to the categories. If not provided, will use self.colors.

Returns: Nothing

category_heatmap(category, variables=None, transpose=False, measure=<function mean>, metric='actual', category_dict=None, xlabel=None, ylabel=None, cat_order=None, log10=False, **kwargs)[source]
compare_categories(category, variables=None, measure=<function mean>)[source]
dict_to_colors(column_name, df=None)[source]
fscore_by_feature(category_column)[source]

Prioritize categorical x continuous interactions to investigate.

The F-test in one-way analysis of variance is used to assess whether the expected values of the variable within the categories in the category column differ from each other. A higher f-value or lower p-value indicates a bigger difference between the categories in a given variable.

This is not meant to be used to make any statistically validated claims about the variables as no assumptions have been considered. Moreover, no adjustment has been made for making doing multiple tests. This should *only * be used as a directional signal of which variables may interact most with the categorical column provided and which should be prioritized for visual investigation.

Parameters

category – Column name of which variable to group observations by and compare distributes of variables across.

Returns

:python:obj:`pandas.core.frame.DataFrame` of variables and

corresponding f-score and p-value.

Return type

fps

labeled_scatter(category=None, x=None, y=None, pca=True, **kwargs)[source]

Method for visualizing clusters in 2D.

pca

Pandas dataframe of the principal components of the numeric columns of the data.

percent_above_below(category, quantile_threshold=0.5, how='above', exclude=None, category_dict=None, transpose=False, cluster=True, variables=None, xlabel=' ', ylabel=' ', cat_order=None, **kwargs)[source]
Parameters
  • category

  • quantile_threshold

  • how

  • exclude

  • category_dict

  • transpose

  • cluster

  • variables

  • xlabel

  • ylabel

  • cat_order

  • **kwargs

Returns:

scaled

Pandas dataframe of numeric data scaled.

subplots(columns_to_plot, kind, color_by=None, sort_by=None, ascending=False, layout=None, titles=None, main_title=None, xlim=None, ylim=None, legend_labels=None, legend_order=None, figsize=(16, 10), counts=False, top_counts=None, min_counts=None, **kwargs)[source]

Create figure with many subplots at once from dataframe.

This method will create a figure with subplots based on the column names inputted.

Parameters
  • columns_to_plot (list of [str or lists) –

    The column(s) to plot in each figure.

    Univariate only: If only plotting univariate vishelper,

    columns_to_plot will look like [‘column1’,’column2’,…, ‘columnN’] where ‘column1’ will be plot in figure 1, ‘column2’ in figure 2, etc.

    **Bivariate only*: If only plotting bivariate vishelper,

    columns_to_plot will look like [[‘columnx1’, ‘columny1’],[‘columnx2’, ‘columny2], …, ] where ‘columnx1’ will be plotted vs ‘columny1’ in figure 1.

    Mix of univariate and bivariate: If plotting a mix of plot

    types, columns_to_plot will look something like: [‘column1’, [‘columnx2’, ‘columny2’], ‘column3’,…]. Note that this will require a mix of plot types and currently **kwargs cannot be provided that don’t work in all plot functions.

  • kind (str or list of str) – What type of plot to plot. If a string, the plot type is used for each subplot. If a list, it should be the same length as columns_to_plot and describe what type of plot to use in each subplot.

  • layout (tuple, optional) – # of rows x # columns. If not given, the layout will default to N x 2 where N is calculated based on length of columns_to_plot

  • main_title (str, optional) – Optional, title for the entire figure.

:param titles (list of: obj:`str, optional): List of titles for each subplot

corresponding to the order of columns_to_plot

Parameters
  • ( (min_counts) – bool:): If True, plot pd.DataFrame.value_counts() is plotted rather than the data frame data. This is typically used for categorical fields.

  • ( – int:, optional): If counts is True and this argument is provided, only the first top_counts number of rows from the pd.DataFrame.value_counts() data when the dataframe is sorted from highest to lowest counts.

  • ( – int:, optional): If counts is True and this argument is provided, only the first min_counts number of rows from the pd.DataFrame.value_counts() data when the dataframe is sorted from lowest to highest counts.

  • **kwargs

    Any other key word arguments will be fed into the plotting function and should be arguments to the core :python:mod:`matplotlib` plotting function (e.g. bin for histograms). Currently cannot feed arguments that don’t apply to all plot kinds being used (as they are automatically filled in).

    TO DO: something to allow for keyword arguments to only be fed functions that they apply to

Returns

:python:obj:matplotlib.figure.Figure` axes: :python:obj:`numpy.ndarray` of :python:obj:`numpy.ndarray` of :python:obj:`matplotlib.axes._subplots.AxesSubplot`

Return type

fig

to_labels(columns)[source]

Convert column names to figure labels.

Converts a list of column names into labels based on either the provided dictionary column_labels or if not provided, based on the labelfy function which replaces undersores with spaces and capitalizes the first letter of the string.

Parameters

columns (list of str) – List of column names to convert to labels.

Returns

List of labels corresponding to

the provdied column names.

Return type

labels (list of str)

vishelper.dfplot.frac_outside_threshold(group, thresholds, how='above', exclude=None)[source]

Helper function for computing the

vishelper.helpers module

vishelper.helpers.get_ax_fig(ax, figsize=None, kwargs=None)[source]
vishelper.helpers.get_formats(kwargs, *attributes)[source]
vishelper.helpers.listify(l, multiplier=1, order=1)[source]

Embeds a list in a list or replicates a list to meet shape requirements.

vishelper.helpers.parse_df(x, y, df, labels=None, color_by=None)[source]

vishelper.interactive module

vishelper.interactive.interactive_heatmap(df, save_path, ycolumn='dayofweek', xcolumn='weekof', value_column='value', x_range=None, y_range=None, colors=None, vmin=None, vmax=None, bokehtools='hover,save,pan,box_zoom,reset,wheel_zoom', title='', plot_width=900, plot_height=500, min_border_right=0, colorbar_format='%d lbs', x_axis_location='above', y_axis_location='left', toolbar_location='below', colorbar_orientation='vertical', colorbar_place='right', tooltips=None, label_font_size='10pt', xlabel_orientation=None, colorbar_label_standoff=20, colorbar_major_label_text_align='center', xlabel='', ylabel='')[source]

Creates an interactive heatmap with tooltips

Parameters
  • df

  • save_path (str) – Where to save the output

  • ycolumn (str) – Which column in the dataframe represents the column that indicates which row of the heatmap (default: ‘dayofweek’)

  • xcolumn (str) – Which column in the dataframe represents the column that indicates which column of the heatmap (default: ‘weekof’)

  • value_column (str) – Which column in the dataframe the intersection of the row and column should be colored according to.

  • x_range (list or similar) – The possible row values (e.g. Monday, Tuesday..). Defaults to the unique set of values in the xcolumn

  • y_range (list or similar) – The possible column values (e.g. Week of Jan 1, Week of Jan 8, …). Defaults to the unique set of values in the ycolumn

  • colors – Color scale to use. Defaults to palettable.colorbrewer.sequential.BuGn_9.hex_colors

  • vmin

  • vmax

  • bokehtools

  • title

  • plot_width

  • plot_height

  • min_border_right (int) – Minimum border left between right side of image and border of figure. Default 0. It is recommended to change to ~80 when setting colorbar_orientation to horizontal to allow room for x-axis labels which are oriented pi/3

  • colorbar_format

  • x_axis_location – which side to put the x-axis (column) labels. Default: ‘above’. Options: ‘above’, ‘below’

  • y_axis_location – which side to put the y-axis (row) labels. Default: ‘left’. Options: ‘left’, ‘right’

  • colorbar_orientation (str) – How to orient the colorbar, ‘vertical’ or ‘horizontal’. Default: ‘vertical’

  • colorbar_place (str, optional) – where to add the colorbar (default: ‘right’) Valid places are: ‘left’, ‘right’, ‘above’, ‘below’, ‘center’.

  • toolbar_location

  • tooltips

  • label_font_size

  • xlabel_orientation (float) – Orientation of labels on x-axis. If left as None, default is pi/3

  • colorbar_label_standoff (int) – How much space to leave between colorbar and colorbar labels. Default 20. It is recommended to set to ~5 for vertical color bars.

  • colorbar_major_label_text_align (str) – How to align tick labels to ticks. Default ‘center’.

  • xlabel (str) – Label for x-axis. Default=””

  • ylabel (str) – Label for y-axis. Default=””

Returns:

vishelper.plot module

vishelper.plot.plot(x=None, y=None, df=None, kind=None, plot_function=None, ax=None, xlabel=None, ylabel=None, title=None, legend=None, legend_kwargs=None, ticks=None, labels=None, color=None, color_data=None, color_by=None, figsize=None, xlim=None, ylim=None, tight_layout=False, **kwargs)[source]

Makes a plot of one or more univariate or multivariate datasets.

Options for kind of plot:

Univariate, continuous data (y=None):

  • hist (histogram)

Bivariate, continous x continuous data:

  • scatter

  • line

Bivariate, categorical x continuous data:

  • boxplot

  • bar

  • barh (horizontal bar plot)

Multivariate, continuous data (x=None, y=None, df != None):

  • heatmap

Parameters
  • x (optional, default None) –

    Data or column name(s) to be plotted for univariate plots, along the x-axis for continuous x continuous data, or categorical data for categorical x continuous data.

    • For one dataset: list or np.array of x-data or str denoting column of df to use for x-data.

    • More than one dataset: list of list of x-data sets or list of str with columns to be used for x-data.

    • If None, df must be provided and kind must be ‘heatmap’.

  • y (optional) – Univariate: None (default) For one bivariate dataset: list or np.array of y-data or str denoting column of df to use for y-data. More than one bivariate dataset: list of list of y-data sets or list of str with columns to be used for y-data.

  • kind (str, optional) –

    Type of plot. Defaults to None, which implies plot_function must be given. Options for kind:

    Univariate, continuous data (y=None):

    • hist (histogram)

    Bivariate, continous x continuous data:

    • scatter

    • line

    Bivariate, categorical x continuous data:

    • boxplot

    • bar

    • barh (horizontal bar plot)

    Multivariate, continuous data (x=None, y=None, df != None):

    • heatmap

  • df – Dataframe containing the data to be plotted. Default None. Data must be provided in x (and y if bivariate)

  • plot_function – Default None. If “kind” is not given, can provide a plot function with the form plot_function(x, y, ax, **kwargs) or plot_function(x, ax, **kwargs).

  • ax (matplotlib.axes._subplots.AxesSubplot, optional) – Matplotlib axes handle. Default is None and a new ax is generated along with the fig.

  • xlabel (str, optional) – Label for x axis. Default None.

  • ylabel (str, optional) – Label for y axis. Default None.

  • title (str, optional) – Plot title. Default None.

  • legend (bool, optional) – A legend will be displayed if legend=True or legend=None and more than one thing is being plot. Default None.

  • ( (legend_kwargs) – obj`dict`, optional): Dictionary of keyword arguments for legend (see legend for more when legend will be displayed). Default None.

  • labels (list of str, optional) – List of labels for legend. Default None. If legend is True, and no labels are provided, labels will be “Set N” where N is the ordering of the inputs.

  • ticks (list of str, optional) – List of tick labels. Default None.

  • color (str or list of str, optional) – Color to plot. List of colors if there is more than one thing being plot. Default is None, in which case, if color_data is also None, default colors from vishelper.config.formatting[“color.all”] will be used.

  • color_data (list of str, optional) – If provided, list should be the same length of the data, providing an individual color for each data point. Default None in which case all points will be colored according to color argument.

  • color_by (str) – Which column to color by if providing a dataframe in df

  • figsize (tuple, optional) – Figure size. Default is None and plot will be sized based on vishelper.config.formatting[‘figure.figsize’].

  • xlim (tuple, optional) – Tuple of minimum and maximum x values to plot (xmin, xmax). Default is None and matplotlib chooses these values.

  • ylim (tuple, optional) – Tuple of minimum and maximum y values to plot (ymin, ymax). Default is None and matplotlib chooses these values.

  • tight_layout (bool, optional) – Envokes plt.tight_layout() to ensure enough space is given to plot elements

  • **kwargs – These kwargs are any that are relevant to the plot kind provided.

Returns

Matplotlib figure object ax (matplotlib.axes._subplots.AxesSubplot): Axes object with the plot(s)

Return type

fig (matplotlib.figure.Figure)

vishelper.plot.plotxy(x, y, ax, plot_function, plot_color, df=None, labels=None, stacked=False, grouped=False, **kwargs)[source]

vishelper.reformat module

vishelper.reformat.add_labels(ax, xlabel=None, ylabel=None, title=None, main_title=None)[source]

Adds xlabel, ylabel, title, main_title, if provided to ax with size given by the formatting dict.

vishelper.reformat.adjust_lims(ax, xlim=None, ylim=None)[source]

Adjusts the x-axis and y-axis view limits of ax if xlim and/or ylim are provided.

Parameters
  • ax (matplotlib.axes._subplots.AxesSubplot) – Matplotlib axes handle

  • xlim (tuple, optional) – Tuple of (x_min, x_max) giving the range of x-values to view in the plot. If xlim=None (default), the x-axis view limits will not be changed.

  • ylim (tuple, optional) – Tuple of (y_min, y_max) giving the range of y-values to view in the plot. If ylim=None (default), the x-axis view limits will not be changed.

Returns

Matplotlib axes handle with adjusted xlim and ylim

Return type

ax (matplotlib.axes._subplots.AxesSubplot)

vishelper.reformat.decide_legend(ax, legend, plot_legend, legend_kwargs)[source]
vishelper.reformat.fake_legend(ax, legend_labels, colors, marker=None, size=None, fontsize=None, linestyle='', loc=None, bbox_to_anchor=None, where='best', **kwargs)[source]

Adds a fake legend to the plot with the provided legend labels and corresponding colors and attributes.

Parameters
  • ax (matplotlib.axes._subplots.AxesSubplot) – Matplotlib axes handle

  • legend_labels (list of str) – Labels for the legend items.

  • colors (list) – List of colors of the items in the legend.

  • marker (str or list of str, optional) – Marker for the items in the legend. Defaults to formatting[‘legend.marker’]

  • size (str or list of str, optional) – Marker size for the items in the legend. Defaults to formatting[‘markersize’]

  • where (str, optional) – Where to put the legend. Options are right, below, and best

  • linestyle (str or list of str, optional) – Line style for items in legend. Defaults to “” (no line).

  • loc ('str`, optional) – Location for where to place the legend. Defaults to “best”.

  • bbox_to_anchor (tuple) – Where to anchor legend, used to place legend outside plot. Default None.

  • **kwargs – Keyword arguments passed to ax.legend()

Returns

Matplotlib axes handle with fake legend

Return type

ax (matplotlib.axes._subplots.AxesSubplot)

vishelper.reformat.labelfy(labels, label_map=None, replacements=None)[source]

Module contents