# Przemyslaw Biecek

#### 35 packages on CRAN

#### 1 packages on GitHub

#### 1 packages on Bioconductor

Data exploration and modelling is a process in which a lot of data artifacts are produced. Artifacts like: subsets, data aggregates, plots, statistical models, different versions of data sets and different versions of results. The more projects we work with the more artifacts are produced and the harder it is to manage these artifacts. Archivist helps to store and manage artifacts created in R. Archivist allows you to store selected artifacts as a binary files together with their metadata and relations. Archivist allows to share artifacts with others, either through shared folder or github. Archivist allows to look for already created artifacts by using it's class, name, date of the creation or other properties. Makes it easy to restore such artifacts. Archivist allows to check if new artifact is the exact copy that was produced some time ago. That might be useful either for testing or caching.

Three games: proton, frequon and regression. Each one is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. In proton you have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. In frequon you will help to perform statistical cryptanalytic attack on a corpus of ciphered messages. This time seven sub-tasks are pushing the bar much higher. Do you accept the challenge? In regression you will test your modeling skills in a series of eight sub-tasks. Try only if ANOVA is your close friend. It's a part of Beta and Bit project. You will find more about the Beta and Bit project at <http://betabit.wiki>.

Two partially supervised mixture modeling methods: soft-label and belief-based modeling are implemented. For completeness, we equipped the package also with the functionality of unsupervised, semi- and fully supervised mixture modeling. The package can be applied also to selection of the best-fitting from a set of models with different component numbers or constraints on their structures. For detailed introduction see: Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.

Model agnostic tool for decomposition of predictions from black boxes. Break Down Table shows contributions of every variable to a final prediction. Break Down Plot presents variable contributions in a concise graphical way. This package work for binary classifiers and general regression models.

Ceteris Paribus Profiles (What-If Plots) are designed to present model responses around selected points in a feature space. For example around a single prediction for an interesting observation. Plots are designed to work in a model-agnostic fashion, they are working for any predictive Machine Learning model and allow for model comparisons. Ceteris Paribus Plots supplement the Break Down Plots from 'breakDown' package.

Machine Learning (ML) models are widely used and have various applications in classification or regression. Models created with boosting, bagging, stacking or similar techniques are often used due to their high performance, but such black-box models usually lack of interpretability. DALEX package contains various explainers that help to understand the link between input variables and model output. The single_variable() explainer extracts conditional response of a model as a function of a single selected variable. It is a wrapper over packages 'pdp' (Greenwell 2017) <doi:10.32614/RJ-2017-016>, 'ALEPlot' (Apley 2018) <arXiv:1612.08468> and 'factorMerger' (Sitko and Biecek 2017) <arXiv:1709.04412>. The single_prediction() explainer attributes parts of a model prediction to particular variables used in the model. It is a wrapper over 'breakDown' package (Staniak and Biecek 2018) <doi:10.32614/RJ-2018-072>. The variable_dropout() explainer calculates variable importance scores based on variable shuffling (Fisher at al. 2018) <arXiv:1801.01489>. All these explainers can be plotted with generic plot() function and compared across different models. 'DALEX' is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>.

Machine Learning models are widely used and have various applications in classification or regression tasks. Due to increasing computational power, availability of new data sources and new methods, ML models are more and more complex. Models created with techniques like boosting, bagging of neural networks are true black boxes. It is hard to trace the link between input variables and model outcomes. They are used because of high performance, but lack of interpretability is one of their weakest sides. In many applications we need to know, understand or prove how input variables are used in the model and what impact do they have on final model prediction. DALEX2 is a collection of tools that help to understand how complex predictive models are working. DALEX2 is a part of DrWhy universe for tools for Explanation, Exploration and Visualisation for Predictive Models.

Smooth testing of goodness of fit. These tests are data driven (alternative hypothesis is dynamically selected based on data). In this package you will find various tests for exponent, Gaussian, Gumbel and uniform distribution.

Concept drift refers to the change in the data distribution or in the relationships between variables over time. 'drifter' calculates distances between variable distributions or variable relations and identifies both types of drift. Key functions are: calculate_covariate_drift() checks distance between corresponding variables in two datasets, calculate_residuals_drift() checks distance between residual distributions for two models, calculate_model_drift() checks distance between partial dependency profiles for two models, check_drift() executes all checks against drift. 'drifter' is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>.

Model agnostic tool for decomposition of predictions from black boxes. Supports additive attributions and attributions with interactions. The Break Down Table shows contributions of every variable to a final prediction. The Break Down Plot presents variable contributions in a concise graphical way. This package works for classification and regression models. It is an extension of the 'breakDown' package (Staniak and Biecek 2018) <doi:10.32614/RJ-2018-072>, with new and faster strategies for orderings. It supports interactions in explanations and has interactive visuals (implemented with 'D3.js' library). The methodology behind is described in the 'iBreakDown' article (Gosiewska and Biecek 2019) <arXiv:1903.11420> This package is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>.

Collection of tools for assessment of feature importance and feature effects. Key functions are: feature_importance() for assessment of global level feature importance, ceteris_paribus() for calculation of the what-if plots, partial_dependency() for partial dependency plots, conditional_dependency() for conditional dependency plots, accumulated_dependency() for accumulated local effects plots, aggregate_profiles() and cluster_profiles() for aggregation of ceteris paribus profiles, generic print() and plot() for better usability of selected explainers, generic plotD3() for interactive, D3 based explanations, and generic describe() for explanations in natural language. The package 'ingredients' is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>.

The 'DrWhy.AI' is the collection of tools for Explainable AI (XAI). It's based on shared principles and simple grammar for exploration, explanation and visualisation of predictive models. This package is designed to make it easy to install and load multiple 'DALEXverse' packages in a single step. It is heavily inspired by the 'tidyverse'.

A set of datasets and functions used in the book 'Modele liniowe i mieszane w R, wraz z przykladami w analizie danych'. Datasets either come from real studies or are created to be as similar as possible to real studies.

The data sets used in the online course ,,PogromcyDanych''. You can process data in many ways. The course Data Crunchers will introduce you to this variety. For this reason we will work on datasets of different size (from several to several hundred thousand rows), with various level of complexity (from two to two thousand columns) and prepared in different formats (text data, quantitative data and qualitative data). All of these data sets were gathered in a single big package called PogromcyDanych to facilitate access to them. It contains all sorts of data sets such as data about offer prices of cars, results of opinion polls, information about changes in stock market indices, data about names given to newborn babies, ski jumping results or information about outcomes of breast cancer patients treatment.

'The Proton Game' is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. You have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. The knowledge of dplyr is not required but may be very helpful. This game is linked with the ,,Pietraszko's Cave'' story available at http://biecek.pl/BetaBit/Warsaw. It's a part of Beta and Bit series. You will find more about the Beta and Bit series at http://biecek.pl/BetaBit.

Data sets and functions used in the polish book "Przewodnik po pakiecie R" (The Hitchhiker's Guide to the R). See more at <http://biecek.pl/R>. Among others you will find here data about housing prices, cancer patients, running times and many others.

Tools for accessing and processing datasets prepared by the Foundation SmarterPoland.pl. Among all: access to API of Google Maps, Central Statistical Office of Poland, MojePanstwo, Eurostat, WHO and other sources.

The extension of the 'archivist' package integrating the archivist with GitHub via GitHub API, 'git2r' packages and 'httr' package.

Provides an easy to use unified interface for creating validation plots for any model. The 'auditor' helps to avoid repetitive work consisting of writing code needed to create residual plots. This visualizations allow to asses and compare the goodness of fit, performance, and similarity of models.

Estimate coefficients of Cox proportional hazards model using stochastic gradient descent algorithm for batch data.

Tool for analyzing competing risks models. The main point of interest is testing differences between groups (as described in R.J Gray (1988) <doi:10.1214/aos/1176350951> and J.P. Fine, R.J Gray (1999) <doi:10.2307/2670170>) and visualizations of survival and cumulative incidence curves.

Structure mining from 'XGBoost' and 'LightGBM' models. Key functionalities of this package cover: visualisation of tree-based ensembles models, identification of interactions, measuring of variable importance, measuring of interaction importance, explanation of single prediction with break down plots (based on 'xgboostExplainer' and 'breakDown' packages). To download the 'LightGBM' use the following link: <https://github.com/Microsoft/LightGBM>. 'EIX' is a part of the 'DrWhy.AI' universe.

Tools to download data from the Eurostat database <http://ec.europa.eu/eurostat> together with search and manipulation utilities.

The Merging Path Plot is a methodology for adaptive fusing of k-groups with likelihood-based model selection. This package contains tools for exploration and visualization of k-group dissimilarities. Comparison of k-groups is one of the most important issues in exploratory analyses and it has zillions of applications. The traditional approach is to use pairwise post hoc tests in order to verify which groups differ significantly. However, this approach fails with a large number of groups in both interpretation and visualization layer. The Merging Path Plot solves this problem by using an easy-to-understand description of dissimilarity among groups based on Likelihood Ratio Test (LRT) statistic (Sitko, Biecek 2017) <arXiv:1709.04412>. 'factorMerger' is a part of the 'DrWhy.AI' universe (Biecek 2018) <arXiv:1806.08915>. Work on this package was financially supported by the 'NCN Opus grant 2016/21/B/ST6/02176'.

Provides tools for importing, merging, and analysing data from international assessment studies (TIMSS, PIRLS, PISA, ICILS, and PIAAC).

Interpretability of complex machine learning models is a growing concern. This package helps to understand key factors that drive the decision made by complicated predictive model (so called black box model). This is achieved through local approximations that are either based on additive regression like model or CART like model that allows for higher interactions. The methodology is based on Tulio Ribeiro, Singh, Guestrin (2016) <doi:10.1145/2939672.2939778>. More details can be found in Staniak, Biecek (2018) <arXiv:1804.01955>.

Local explanations of machine learning models describe, how features contributed to a single prediction. This package implements an explanation method based on LIME (Local Interpretable Model-agnostic Explanations, see Tulio Ribeiro, Singh, Guestrin (2016) <doi:10.1145/2939672.2939778>) in which interpretable inputs are created based on local rather than global behaviour of each original feature.

Website generator with HTML summaries for predictive models. This package uses 'DALEX' explainers to describe global model behavior. We can see how well models behave (tabs: Model Performance, Auditor), how much each variable contributes to predictions (tabs: Variable Response) and which variables are the most important for a given model (tabs: Variable Importance). We can also compare Concept Drift for pairs of models (tabs: Drifter). Additionally, data available on the website can be easily recreated in current R session. Work on this package was financially supported by the NCN Opus grant 2017/27/B/ST6/01307 at Warsaw University of Technology, Faculty of Mathematics and Information Science.

Automate explanation of machine learning predictive models. This package generates advanced interactive and animated model explanations in the form of serverless HTML site. It combines 'R' with 'D3.js' to produce plots and descriptions for local and global explanations. The whole is greater than the sum of its parts, so it also supports EDA (Exploratory Data Analysis) on top of that. 'modelStudio' is a fast and condensed way to get all the answers without much effort. Break down your model and look into its ingredients with only a few lines of code.

A set of tools to help explain which variables are most important in a random forests. Various variable importance measures are calculated and visualized in different settings in order to get an idea on how their importance changes depending on our criteria (Hemant Ishwaran and Udaya B. Kogalur and Eiran Z. Gorodeski and Andy J. Minn and Michael S. Lauer (2010) <doi:10.1198/jasa.2009.tm08622>, Leo Breiman (2001) <doi:10.1023/A:1010933404324>).

Set of functions that access information about deputies and votings in Polish diet from webpage <http://www.sejm.gov.pl/>. The package was developed as a result of an internship in MI2 Group - <http://mi2.mini.pw.edu.pl/>, Faculty of Mathematics and Information Science, Warsaw University of Technology.

Provides SHAP explanations of machine learning models. In applied machine learning, there is a strong belief that we need to strike a balance between interpretability and accuracy. However, in field of the Interpretable Machine Learning, there are more and more new ideas for explaining black-box models. One of the best known method for local explanations is SHapley Additive exPlanations (SHAP) introduced by Lundberg, S., et al., (2016) <arXiv:1705.07874> The SHAP method is used to calculate influences of variables on the particular observation. This method is based on Shapley values, a technique used in game theory. The R package 'shapper' is a port of the Python library 'shap'.

Contains the function 'ggsurvplot()' for drawing easily beautiful and 'ready-to-publish' survival curves with the 'number at risk' table and 'censoring count plot'. Other functions are also available to plot adjusted curves for `Cox` model and to visually examine 'Cox' model assumptions.

Survival models may have very different structures. This package contains functions for creating a unified representation of a survival models, which can be further processed by various survival explainers. Tools implemented in 'survxai' help to understand how input variables are used in the model and what impact do they have on the final model prediction. Currently, four explanation methods are implemented. We can divide them into two groups: local and global.

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care. RTCGA package offers download and integration of the variety and volume of TCGA data using patient barcode key, what enables easier data possession. This may have an benefcial infuence on impact on development of science and improvement of patients' treatment. Furthermore, RTCGA package transforms TCGA data to tidy form which is convenient to use.

Provides an easy to calculate variable importance measure based on Ceteris Paribus plot and is calculated in eight variants. We obtain eight variants measure through the possible combinations of three parameters such as absolute_deviation, point and density.

Builds generalized linear model with automatic data transformation. The 'xspliner' helps to build simple, interpretable models that inherits informations provided by more complicated ones. The resulting model may be treated as explanation of provided black box, that was supplied prior to the algorithm.