Issues with ff.create_distplot()

I have a number of issues with ff.create_distplot():

1. Consider the following data:

import numpy as np

import plotly.graph_objs as go
import plotly.figure_factory as ff
m = np.random.normal(loc=0.08, scale=0.0008, size=5000)

Histogram of the data:

fig = go.FigureWidget()
fig.add_histogram(x=m)
fig

However, when I try to produce a density plot using the figure factory, it does not produce what I want:

hist_data = [m]

group_labels = ['m1']
colors = ['#333F44']

# Create distplot
fig = go.FigureWidget(ff.create_distplot(hist_data, group_labels, show_hist=True, colors=colors))
fig.layout.update(title='Density curve')
fig

I can perhaps tinker with it until it gives me the right plot, but I think there is an issue there.

If I set show_hist=False, the plot looks much better:

The problem seems to be with the bins of the histogram. If we set scale=0.08 we can see that the histogram is displayed only in one bin:


2. Even though the histnorm is set to probability density by default, I did not manage to make it look like a probability density. It looks more like a frequency “distplot”.


3. The curve_type is set to kde. What kind of KDE is being used? I would like to try the epanechnikov kernel for instance.
Is a kde curve type meant to produce something like the density function in R?


4. When several distplots are combined, e.g.:

hist_data = [m, m+0.001]

group_labels = ['m1', 'm2']
colors = ['#333F44', '#37AA9C']

# Create distplot
fig = go.FigureWidget(ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colors))
fig.layout.update(title='Density curve',
                                   )
fig

The rug plot as well as the legend do not appear in the logical sequence.
Sure, we can set

fig.layout.update(legend=dict(traceorder='normal'))

but I think the default should be the order in which they were added.

I also think that the distance between the rug plots is disproportionately big.

Thanks for the detailed description of the issues you’re having with distplot @ursus,

I haven’t actually dug into how this figure factory works yet, so unfortunately I don’t have much guidance to offer at the moment.

@nicolaskruchten, what do you think about eventually adding a px.kde or pd.distplot function to plotly_express (https://github.com/plotly/plotly_express) to handle the distplot usecase?

@ursus, if we decide this is something that makes sense to implement in plotly_express we’ll likely direct our efforts there since plotly_express provides a much more unified and powerful API than the distplot figure factory currently does.

Thanks,
-Jon

Thank you!

plotly_express is great!
:slight_smile:

We could do a few things with px here.

  1. We could add a marginal kwarg to px.histogram to get the ability to do rug, violin and box marginals similar to what we have in px.scatter and px.density_contour
  2. We could add a ‘px.kde’ function that leverages go.Violin under the hood and uses its built-in points system to get the rug. (With an optional marginal kwarg too, why not!)
  3. We could convince the JS guys to add a KDE option to go.Histogram
  4. We could convince the JS guys to add points to go.Histogram

I have mentioned R’s density function.
Mathematica also has a similar command which is really nice: SmoothKernelDistribution.

I’ve implemented idea 1 above: px.histogram() now has a marginal option so you can add the rug there. Still no KDE option though. Toying with the idea of a new kde trace type in plotly.js at the moment… basically a blend of violin and histogram minus histfunc. Would also allow for smooth cumulative density functions which would be nice.

1 Like

That would be extremely welcome, as I have been experiencing issues with ff.create_distplot() as well here.

Could scipy help out in providing robust default kde solvers?

Could scipy help out in providing robust default kde solvers?

Yes, but this would run against one of the most important Plotly Express design goals, which is to do as little work in Python as possible, deferring to the JS layer for almost everything. Notable exceptions include the OLS and LOWESS trendlines, but that’s mostly because the JS layer doesn’t support those. In the case of KDE, the JS layer already has an implementation in the violin trace type so Plotly Express uses that instead so as not to duplicate work.

I forgot that there are also other packages :smiley:

Scipy indeed is the package being used under the hood in ff.distplot(). However, it does not allow for other kernel types.

Richer is the KDE implementation from scikit learn.

For my case:

from sklearn.neighbors.kde import KernelDensity
fig = go.FigureWidget()
for i, kernel in enumerate(['gaussian', 'tophat', 'epanechnikov',
                            'exponential', 'linear', 'cosine']):
    kde = KernelDensity(kernel=kernel, bandwidth=(4*np.std(m)**5/(3*len(m)))**(1/5)).fit(X)
    log_dens = kde.score_samples(X_plot)
    fig.add_scatter(x=X_plot[:, 0], y=np.exp(log_dens), line=dict(width=1.5), name=kernel, showlegend=True)
hist = fig.add_histogram(x=m,  xbins=dict(start=m.min(), end=m.max(), size=0.0002), 
                         opacity=0.6, 
                         marker=dict(color='rgb(0, 0, 100)'), showlegend=False)
rug = fig.add_scatter(x=m, y=np.zeros(len(m)), mode='markers', 
                      marker=dict(symbol='line-ns-open', color='rgb(0, 0, 100)'),
                      yaxis='y2',
                      showlegend=False
                     )

fig.layout = dict(
            xaxis1=dict(domain=[0.0, 1.0],
                        anchor='y2',
                        zeroline=False),
            yaxis1=dict(domain=[0.15, 1],
                        anchor='free',
                        zeroline=False,
                        position=0.0),
            yaxis2=dict(domain=[0, 0.15],
                        anchor='x1',
                        zeroline=False,
                        dtick=1,
                        showticklabels=False)
)
fig

Scicit-learn requires you to set the bandwidth. Bandwidth estimation is on their # TODO list.

I have used the bandwidth which is the default in scipy.
Notice that the same bandwidth is not suitable for the Histogram plot. I have adjusted it to fit nicely. That might not be the right approach. The problem above has revealed that the default bin size in Histogram is not suitable for arrays with low standard deviation (often I also find it to be too large).


I’ve just found this plotly page which shows how it should be done :slight_smile:

Also: my second objection above is invalid as stats.norm.pdf(0, scale=0.0008) = 498.67.

@nicolaskruchten if you consider adding a lightweight kde curve to px.histogram you may want to look at the KDEpy package.

The reported speed is orders of magnitude better.