vishelper.VisDF

class vishelper.dfplot.VisDF(df, column_labels=None, labels=None, cluster_label=None, index=None, numeric_columns=None, nonnumeric_columns=None, color_dict=None, columns_to_color=None, colors=None, univariate_ylabels=None, scale=False, pca=False)[source]

Easily create typical visualizations of data in a pandas dataframe.

This class performs a number of plotting tasks for pandas data frames including easy sub-plotting and consistent axes labeling.

Parameters
  • df (pandas dataframe) – Data to be plotted

  • column_labels (dict) – Dictionary containing mappings of columns of dataframe and corresponding labels to be used instead when plotting for axes labels and legend. If None, column names will have ‘_’ removed and the first letter capitalized (e.g. “path_length” –> “Path length”)

  • cluster_label (str, optional) – If it exists, the name of the column that gives cluster or group assignments (and is not a feature).

  • index (str or list of str, optional) – Name of identifying column such as customer id or transaction id or other column that should not be analyzed.

  • numeric_columns (list of str, optional) – List of column names corresponding to numeric fields. If not provided, this list will be assessed by data type of each column and will exclude the index and cluster label, if given.

  • nonnumeric_columns (list of str, optional) – List of column names corresponding to non-numeric fields. If not provided, this list will be assessed by data type of each column and will exclude the index and cluster label, if given.

  • color_dict (dict) – Optional. Keys correspond to categorical columns where consistent coloring by category is desired. Each key has a dictionary as it’s value with the category names and corresponding colors to use. Dictionary structure is: {‘column_name’:{‘category1’: ‘#colorx’, ‘category2’: ‘#colory’}}

  • columns_to_color (list of str, optional) – List of names of categorical columns to apply consistent coloring to. Colors to assign to each category will be provided by the colors attribute.

  • colors (list) – List of colors to cycle through in plotting (if None provided, will use defaults defined in config file).

  • univariate_ylabels (dict) – Dictionary of univariate plot types and corresponding y-labels to use. Default is dict(hist=’Count’, barh=’Count’).

  • ( (pca) – bool:): Default True. If True, scales the numeric columns of the dataframe and stores them in the scaled attribute.

  • ( – bool:): Default True. If True, calculates the principal components of the numeric data and stores them in the pca attribute.

add_color_dict(column_name, colors=None)[source]

Assign colors to categories within a defined column of the data.

Parameters
  • column_name (str) – Name of categorical column in the data.

  • colors (list of str) – Optional. Colors to be assigned to the categories. If not provided, will use self.colors.

Returns: Nothing

category_heatmap(category, variables=None, transpose=False, measure=<function mean>, metric='actual', category_dict=None, xlabel=None, ylabel=None, cat_order=None, log10=False, **kwargs)[source]
compare_categories(category, variables=None, measure=<function mean>)[source]
dict_to_colors(column_name, df=None)[source]
fscore_by_feature(category_column)[source]

Prioritize categorical x continuous interactions to investigate.

The F-test in one-way analysis of variance is used to assess whether the expected values of the variable within the categories in the category column differ from each other. A higher f-value or lower p-value indicates a bigger difference between the categories in a given variable.

This is not meant to be used to make any statistically validated claims about the variables as no assumptions have been considered. Moreover, no adjustment has been made for making doing multiple tests. This should *only * be used as a directional signal of which variables may interact most with the categorical column provided and which should be prioritized for visual investigation.

Parameters

category – Column name of which variable to group observations by and compare distributes of variables across.

Returns

:python:obj:`pandas.core.frame.DataFrame` of variables and

corresponding f-score and p-value.

Return type

fps

labeled_scatter(category=None, x=None, y=None, pca=True, **kwargs)[source]

Method for visualizing clusters in 2D.

pca

Pandas dataframe of the principal components of the numeric columns of the data.

percent_above_below(category, quantile_threshold=0.5, how='above', exclude=None, category_dict=None, transpose=False, cluster=True, variables=None, xlabel=' ', ylabel=' ', cat_order=None, **kwargs)[source]
Parameters
  • category

  • quantile_threshold

  • how

  • exclude

  • category_dict

  • transpose

  • cluster

  • variables

  • xlabel

  • ylabel

  • cat_order

  • **kwargs

Returns:

scaled

Pandas dataframe of numeric data scaled.

subplots(columns_to_plot, kind, color_by=None, sort_by=None, ascending=False, layout=None, titles=None, main_title=None, xlim=None, ylim=None, legend_labels=None, legend_order=None, figsize=(16, 10), counts=False, top_counts=None, min_counts=None, **kwargs)[source]

Create figure with many subplots at once from dataframe.

This method will create a figure with subplots based on the column names inputted.

Parameters
  • columns_to_plot (list of [str or lists) –

    The column(s) to plot in each figure.

    Univariate only: If only plotting univariate vishelper,

    columns_to_plot will look like [‘column1’,’column2’,…, ‘columnN’] where ‘column1’ will be plot in figure 1, ‘column2’ in figure 2, etc.

    **Bivariate only*: If only plotting bivariate vishelper,

    columns_to_plot will look like [[‘columnx1’, ‘columny1’],[‘columnx2’, ‘columny2], …, ] where ‘columnx1’ will be plotted vs ‘columny1’ in figure 1.

    Mix of univariate and bivariate: If plotting a mix of plot

    types, columns_to_plot will look something like: [‘column1’, [‘columnx2’, ‘columny2’], ‘column3’,…]. Note that this will require a mix of plot types and currently **kwargs cannot be provided that don’t work in all plot functions.

  • kind (str or list of str) – What type of plot to plot. If a string, the plot type is used for each subplot. If a list, it should be the same length as columns_to_plot and describe what type of plot to use in each subplot.

  • layout (tuple, optional) – # of rows x # columns. If not given, the layout will default to N x 2 where N is calculated based on length of columns_to_plot

  • main_title (str, optional) – Optional, title for the entire figure.

:param titles (list of: obj:`str, optional): List of titles for each subplot

corresponding to the order of columns_to_plot

Parameters
  • ( (min_counts) – bool:): If True, plot pd.DataFrame.value_counts() is plotted rather than the data frame data. This is typically used for categorical fields.

  • ( – int:, optional): If counts is True and this argument is provided, only the first top_counts number of rows from the pd.DataFrame.value_counts() data when the dataframe is sorted from highest to lowest counts.

  • ( – int:, optional): If counts is True and this argument is provided, only the first min_counts number of rows from the pd.DataFrame.value_counts() data when the dataframe is sorted from lowest to highest counts.

  • **kwargs

    Any other key word arguments will be fed into the plotting function and should be arguments to the core :python:mod:`matplotlib` plotting function (e.g. bin for histograms). Currently cannot feed arguments that don’t apply to all plot kinds being used (as they are automatically filled in).

    TO DO: something to allow for keyword arguments to only be fed functions that they apply to

Returns

:python:obj:matplotlib.figure.Figure` axes: :python:obj:`numpy.ndarray` of :python:obj:`numpy.ndarray` of :python:obj:`matplotlib.axes._subplots.AxesSubplot`

Return type

fig

to_labels(columns)[source]

Convert column names to figure labels.

Converts a list of column names into labels based on either the provided dictionary column_labels or if not provided, based on the labelfy function which replaces undersores with spaces and capitalizes the first letter of the string.

Parameters

columns (list of str) – List of column names to convert to labels.

Returns

List of labels corresponding to

the provdied column names.

Return type

labels (list of str)