Nonparametric Tests and Why I am Wary of Using Them

Author
Published

May 3, 2020

If you are following online communities where people ask for statistics advice, you’ve probably seen a type of question popping up from time to time: “I am doing an analysis using parametric test X, however, (assumption check) test A reveals that assumption B is violated. What should I do?” (As you may guess, the violated assumption is usually the normality assumption). And people start recommending a number of textbook solutions. One such solution is to stop using the parametric test and to prefer a nonparametric equivalent. For example, if you use \(t\) test and violate the “normality” assumption, some would say that you should use Mann-Whitney U test. And that’s it.

I think these questions are problematic for two reasons. First, they are asking for recipe-like solutions, which could be justified by a few references to the extant literature, without thinking about the specific issues related to the analysis at hand. Second, they encourage binary thinking: “If X is violated, do Y, if not do Z.” It would be better to think about the possible consequences of such violations and act accordingly. Otherwise you might be choosing between two faulty approaches.

I am not against using nonparametric tests per se (they might even have desirable properties), but the way decisions are made regarding these tests. In my case, the foremost reason why I hesitate using nonparametric tests is that they do not generalize well to complex models. It is not an issue to use a nonparametric test comparing two groups. But what happens if you want to estimate cross-level interactions and random effects? As this answer in Cross Validated indicates, we can think parametric assumptions as simplifying heuristics:

More generally, parametric procedures provide a way of imposing structure on otherwise unstructured problems. This is very useful and can be viewed as a kind of simplifying heuristic rather than a belief that the model is literally true. Take for instance the problem of predicting a continuous response \(y\) based on a vector of predictors \(x\) using some regression function \(f\) (even assuming that such a function exists is a kind of parametric restriction). If we assume absolutely nothing about \(f\) then it’s not at all clear how we might proceed in estimating this function. The set of possible answers that we need to search is just too large. But if we restrict the space of possible answers to (for instance) the set of linear functions \(f(x)=\sum_{j=1}^p \beta_j x_j\) then we can actually start making progress. We don’t need to believe that the model holds exactly, we are just making an approximation due to the need to arrive at some answer, however imperfect.

Thus, such a simplifying heuristic (e.g., specifiying a regression function) would enable one to estimate the parameters of a complex model.1

  • 1 In his Regression Modeling Strategies (2015, 359), Harrell argues that semiparametric models (e.g., proportional odds logistic regression) have many advantages such as “robustness and freedom from all distributional assumptions for Y conditional on any given set of predictors.” I would prefer such an approach if needed.

  • References

    Harrell, Frank E. 2015. Regression Modeling Strategies. Springer International Publishing.

    Citation

    BibTeX citation:
    @online{2020,
      author = {, T.E.G.},
      title = {Nonparametric {Tests} and {Why} {I} Am {Wary} of {Using}
        {Them}},
      pages = {undefined},
      date = {2020-05-03},
      langid = {en}
    }
    
    For attribution, please cite this work as:
    T.E.G. 2020. “Nonparametric Tests and Why I Am Wary of Using Them.” May 3, 2020.