# Przemyslaw Biecek

#### 30 packages on CRAN

#### 1 packages on Bioconductor

Machine Learning (ML) models are widely used and have various applications in classification or regression. Models created with boosting, bagging, stacking or similar techniques are often used due to their high performance, but such black-box models usually lack of interpretability. DALEX package contains various explainers that help to understand the link between input variables and model output. The single_variable() explainer extracts conditional response of a model as a function of a single selected variable. It is a wrapper over packages 'pdp' (Greenwell 2017) <doi:10.32614/RJ-2017-016>, 'ALEPlot' (Apley 2018) <arXiv:1612.08468> and 'factorMerger' (Sitko and Biecek 2017) <arXiv:1709.04412>. The single_prediction() explainer attributes parts of a model prediction to particular variables used in the model. It is a wrapper over 'breakDown' package (Staniak and Biecek 2018) <doi:10.32614/RJ-2018-072>. The variable_dropout() explainer calculates variable importance scores based on variable shuffling (Fisher at al. 2018) <arXiv:1801.01489>. All these explainers can be plotted with generic plot() function and compared across different models. 'DALEX' is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>.

Model agnostic tool for decomposition of predictions from black boxes. Break Down Table shows contributions of every variable to a final prediction. Break Down Plot presents variable contributions in a concise graphical way. This package work for binary classifiers and general regression models.

Data exploration and modelling is a process in which a lot of data artifacts are produced. Artifacts like: subsets, data aggregates, plots, statistical models, different versions of data sets and different versions of results. The more projects we work with the more artifacts are produced and the harder it is to manage these artifacts. Archivist helps to store and manage artifacts created in R. Archivist allows you to store selected artifacts as a binary files together with their metadata and relations. Archivist allows to share artifacts with others, either through shared folder or github. Archivist allows to look for already created artifacts by using it's class, name, date of the creation or other properties. Makes it easy to restore such artifacts. Archivist allows to check if new artifact is the exact copy that was produced some time ago. That might be useful either for testing or caching.

Two partially supervised mixture modeling methods: soft-label and belief-based modeling are implemented. For completeness, we equipped the package also with the functionality of unsupervised, semi- and fully supervised mixture modeling. The package can be applied also to selection of the best-fitting from a set of models with different component numbers or constraints on their structures. For detailed introduction see: Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.

Data sets and functions used in the polish book "Przewodnik po pakiecie R" (The Hitchhiker's Guide to the R). See more at <http://biecek.pl/R>. Among others you will find here data about housing prices, cancer patients, running times and many others.

Tools for accessing and processing datasets prepared by the Foundation SmarterPoland.pl. Among all: access to API of Google Maps, Central Statistical Office of Poland, MojePanstwo, Eurostat, WHO and other sources.

Ceteris Paribus Profiles (What-If Plots) are designed to present model responses around selected points in a feature space. For example around a single prediction for an interesting observation. Plots are designed to work in a model-agnostic fashion, they are working for any predictive Machine Learning model and allow for model comparisons. Ceteris Paribus Plots supplement the Break Down Plots from 'breakDown' package.

The data sets used in the online course ,,PogromcyDanych''. You can process data in many ways. The course Data Crunchers will introduce you to this variety. For this reason we will work on datasets of different size (from several to several hundred thousand rows), with various level of complexity (from two to two thousand columns) and prepared in different formats (text data, quantitative data and qualitative data). All of these data sets were gathered in a single big package called PogromcyDanych to facilitate access to them. It contains all sorts of data sets such as data about offer prices of cars, results of opinion polls, information about changes in stock market indices, data about names given to newborn babies, ski jumping results or information about outcomes of breast cancer patients treatment.

Machine Learning models are widely used and have various applications in classification or regression tasks. Due to increasing computational power, availability of new data sources and new methods, ML models are more and more complex. Models created with techniques like boosting, bagging of neural networks are true black boxes. It is hard to trace the link between input variables and model outcomes. They are used because of high performance, but lack of interpretability is one of their weakest sides. In many applications we need to know, understand or prove how input variables are used in the model and what impact do they have on final model prediction. DALEX2 is a collection of tools that help to understand how complex predictive models are working. DALEX2 is a part of DrWhy universe for tools for Explanation, Exploration and Visualisation for Predictive Models.

Three games: proton, frequon and regression. Each one is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. In proton you have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. In frequon you will help to perform statistical cryptanalytic attack on a corpus of ciphered messages. This time seven sub-tasks are pushing the bar much higher. Do you accept the challenge? In regression you will test your modeling skills in a series of eight sub-tasks. Try only if ANOVA is your close friend. It's a part of Beta and Bit project. You will find more about the Beta and Bit project at <http://betabit.wiki>.

Smooth testing of goodness of fit. These tests are data driven (alternative hypothesis is dynamically selected based on data). In this package you will find various tests for exponent, Gaussian, Gumbel and uniform distribution.

A set of datasets and functions used in the book 'Modele liniowe i mieszane w R, wraz z przykladami w analizie danych'. Datasets either come from real studies or are created to be as similar as possible to real studies.

'The Proton Game' is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. You have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. The knowledge of dplyr is not required but may be very helpful. This game is linked with the ,,Pietraszko's Cave'' story available at http://biecek.pl/BetaBit/Warsaw. It's a part of Beta and Bit series. You will find more about the Beta and Bit series at http://biecek.pl/BetaBit.

Model agnostic tool for decomposition of predictions from black boxes. Supports additive attributions and attributions with interactions. The Break Down Table shows contributions of every variable to a final prediction. The Break Down Plot presents variable contributions in a concise graphical way. This package works for classification and regression models. It is an extension of the 'breakDown' package (Staniak and Biecek 2018) <doi:10.32614/RJ-2018-072>, with new and faster strategies for orderings. It supports interactions in explanations and has interactive visuals (implemented with 'D3.js' library). The methodology behind is described in the 'iBreakDown' article (Gosiewska and Biecek 2019) <arXiv:1903.11420> This package is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>.

Collection of tools for assessment of feature importance and feature effects. Key functions are: feature_importance() for assessment of global level feature importance, ceteris_paribus() for calculation of the what-if plots, partial_dependency() for partial dependency plots, conditional_dependency() for conditional dependency plots, accumulated_dependency() for accumulated local effects plots, aggregate_profiles() and cluster_profiles() for aggregation of ceteris paribus profiles, theme_drwhy() with a 'ggplot2' skin for all plots, generic print() and plot() for better usability of selected explainers. The package 'ingredients' is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>.

Contains the function 'ggsurvplot()' for drawing easily beautiful and 'ready-to-publish' survival curves with the 'number at risk' table and 'censoring count plot'. Other functions are also available to plot adjusted curves for `Cox` model and to visually examine 'Cox' model assumptions.

Provides tools for importing, merging, and analysing data from international assessment studies (TIMSS, PIRLS, PISA, ICILS, and PIAAC).

The Merging Path Plot is a methodology for adaptive fusing of k-groups with likelihood-based model selection. This package contains tools for exploration and visualization of k-group dissimilarities. Comparison of k-groups is one of the most important issues in exploratory analyses and it has zillions of applications. The traditional approach is to use pairwise post hoc tests in order to verify which groups differ significantly. However, this approach fails with a large number of groups in both interpretation and visualization layer. The Merging Path Plot solves this problem by using an easy-to-understand description of dissimilarity among groups based on Likelihood Ratio Test (LRT) statistic. Work on this package was financially supported by the 'NCN Opus grant 2016/21/B/ST6/02176'.

Tools to download data from the Eurostat database <http://ec.europa.eu/eurostat> together with search and manipulation utilities.

A set of tools to help explain which variables are most important in a random forests. Various variable importance measures are calculated and visualized in different settings in order to get an idea on how their importance changes depending on our criteria (Hemant Ishwaran and Udaya B. Kogalur and Eiran Z. Gorodeski and Andy J. Minn and Michael S. Lauer (2010) <doi:10.1198/jasa.2009.tm08622>, Leo Breiman (2001) <doi:10.1023/A:1010933404324>).

Provides an easy to use unified interface for creating validation plots for any model. The 'auditor' helps to avoid repetitive work consisting of writing code needed to create residual plots. This visualizations allow to asses and compare the goodness of fit, performance, and similarity of models.

Interpretability of complex machine learning models is a growing concern. This package helps to understand key factors that drive the decision made by complicated predictive model (so called black box model). This is achieved through local approximations that are either based on additive regression like model or CART like model that allows for higher interactions. The methodology is based on Tulio Ribeiro, Singh, Guestrin (2016) <doi:10.1145/2939672.2939778>. More details can be found in Staniak, Biecek (2018) <arXiv:1804.01955>.

The extension of the 'archivist' package integrating the archivist with GitHub via GitHub API, 'git2r' packages and 'httr' package.

Tool for analyzing competing risks models. The main point of interest is testing differences between groups (as described in R.J Gray (1988) <doi:10.1214/aos/1176350951> and J.P. Fine, R.J Gray (1999) <doi:10.2307/2670170>) and visualizations of survival and cumulative incidence curves.

Estimate coefficients of Cox proportional hazards model using stochastic gradient descent algorithm for batch data.

Provides SHAP explanations of machine learning models. In applied machine learning, there is a strong belief that we need to strike a balance between interpretability and accuracy. However, in field of the Interpretable Machine Learning, there are more and more new ideas for explaining black-box models. One of the best known method for local explanations is SHapley Additive exPlanations (SHAP) introduced by Lundberg, S., et al., (2016) <arXiv:1705.07874> The SHAP method is used to calculate influences of variables on the particular observation. This method is based on Shapley values, a technique used in game theory. The R package 'shapper' is a port of the Python library 'shap'.

Set of functions that access information about deputies and votings in Polish diet from webpage <http://www.sejm.gov.pl/>. The package was developed as a result of an internship in MI2 Group - <http://mi2.mini.pw.edu.pl/>, Faculty of Mathematics and Information Science, Warsaw University of Technology.

Builds generalized linear model with automatic data transformation. The 'xspliner' helps to build simple, interpretable models that inherits informations provided by more complicated ones. The resulting model may be treated as explanation of provided black box, that was supplied prior to the algorithm.

Survival models may have very different structures. This package contains functions for creating a unified representation of a survival models, which can be further processed by various survival explainers. Tools implemented in 'survxai' help to understand how input variables are used in the model and what impact do they have on the final model prediction. Currently, four explanation methods are implemented. We can divide them into two groups: local and global.

Local explanations of machine learning models describe, how features contributed to a single prediction. This package implements an explanation method based on LIME (Local Interpretable Model-agnostic Explanations, see Tulio Ribeiro, Singh, Guestrin (2016) <doi:10.1145/2939672.2939778>) in which interpretable inputs are created based on local rather than global behaviour of each original feature.

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care. RTCGA package offers download and integration of the variety and volume of TCGA data using patient barcode key, what enables easier data possession. This may have an benefcial infuence on impact on development of science and improvement of patients' treatment. Furthermore, RTCGA package transforms TCGA data to tidy form which is convenient to use.