Pipe an object forward into a function call/expression.

The %>% operator pipes the left-hand side into an expression on the right-hand side. The expression can contain a . as placeholder to indicate the position taken by the object in the pipeline. If not present, it will be squeezed in as the first argument. If the right-hand side expression is a function call that takes only one argument, one can omit parentheses and the .. Only the outmost call is matched against the dot, which means that e.g. formulas can still use a dot which will not be matched. Nested functions will not be matched either.

lhs %>% rhs
That which is to be piped
the pipe to be stuffed with tobacco.
  • %>%

Batting %>%
  group_by(playerID) %>%
  summarise(total = sum(G)) %>%
  arrange(desc(total)) %>%

iris %>%
  filter(Petal.Length > 5) %>%
  select(-Species) %>%

iris %>%
  aggregate(. ~ Species, ., mean)

rnorm(1000) %>% abs %>% sum
Documentation reproduced from package magrittr, version 1.0.0, License: MIT + file LICENSE

Community examples

luciabertova.lb@gmail.com at Mar 6, 2019 magrittr v1.5

--- title: "Exploratory Analysis - Instacart" author: "Philipp Spachtholz" output: html_document: fig_height: 4 fig_width: 7 theme: cosmo --- ### Welcome and good luck to you all at Instacart Market Basket Competition! Here is a first exploratory analysis of the competition dataset. On its website Instacart has a recommendation feature, suggesting the users some items that he/she may buy again. Our task is to predict which items will be reordered on the next order. The dataset consists of information about 3.4 million grocery orders, distributed across 6 csv files. ### Read in the data ```{r message=FALSE, warning=FALSE, results='hide'} library(data.table) library(dplyr) library(ggplot2) library(knitr) library(stringr) library(DT) orders <- fread('../input/orders.csv') products <- fread('../input/products.csv') order_products <- fread('../input/order_products__train.csv') order_products_prior <- fread('../input/order_products__prior.csv') aisles <- fread('../input/aisles.csv') departments <- fread('../input/departments.csv') ``` ```{r include=FALSE} options(tibble.width = Inf) ``` Lets first have a look at these files: ### Peek at the dataset {.tabset} #### orders This file gives a list of all orders we have in the dataset. 1 row per order. For example, we can see that user 1 has 11 orders, 1 of which is in the train set, and 10 of which are prior orders. The orders.csv doesn't tell us about which products were ordered. This is contained in the order_products.csv ```{r, result='asis'} kable(head(orders,12)) glimpse(orders) ``` #### order_products_train This file gives us information about which products (product_id) were ordered. It also contains information of the order (add_to_cart_order) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0). For example, we see below that order_id 1 had 8 products, 4 of which are reorders. Still we don't know what these products are. This information is in the products.csv ```{r} kable(head(order_products,10)) glimpse(order_products) ``` #### products This file contains the names of the products with their corresponding product_id. Furthermore the aisle and deparment are included. ```{r} kable(head(products,10)) glimpse(products) ``` #### order_products_prior This file is structurally the same as the other_products_train.csv. ```{r, result='asis'} kable(head(order_products_prior,10)) glimpse(order_products_prior) ``` #### aisles This file contains the different aisles. ```{r, result='asis'} kable(head(aisles,10)) glimpse(aisles) ``` #### departments ```{r, result='asis'} kable(head(departments,10)) glimpse(departments) ``` ### Recode variables We should do some recoding and convert character variables to factors. ```{r message=FALSE, warning=FALSE} orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set)) products <- products %>% mutate(product_name = as.factor(product_name)) aisles <- aisles %>% mutate(aisle = as.factor(aisle)) departments <- departments %>% mutate(department = as.factor(department)) ``` ### When do people order? Let's have a look when people buy groceries online. #### Hour of Day There is a clear effect of hour of day on order volume. Most orders are between 8.00-18.00 ```{r warning=FALSE} orders %>% ggplot(aes(x=order_hour_of_day)) + geom_histogram(stat="count",fill="red") ``` #### Day of Week There is a clear effect of day of the week. Most orders are on days 0 and 1. Unfortunately there is no info regarding which values represent which day, but one would assume that this is the weekend. ```{r warning=FALSE} orders %>% ggplot(aes(x=order_dow)) + geom_histogram(stat="count",fill="red") ``` ### When do they order again? People seem to order more often after exactly 1 week. ```{r warning=FALSE} orders %>% ggplot(aes(x=days_since_prior_order)) + geom_histogram(stat="count",fill="red") ``` ### How many prior orders are there? We can see that there are always at least 3 prior orders. ```{r} orders %>% filter(eval_set=="prior") %>% count(order_number) %>% ggplot(aes(order_number,n)) + geom_line(color="red", size=1)+geom_point(size=2, color="red") ``` ### How many items do people buy? {.tabset} Let's have a look how many items are in the orders. We can see that people most often order around 5 items. The distributions are comparable between the train and prior order set. #### Train set ```{r warning=FALSE} order_products %>% group_by(order_id) %>% summarize(n_items = last(add_to_cart_order)) %>% ggplot(aes(x=n_items))+ geom_histogram(stat="count",fill="red") + geom_rug()+ coord_cartesian(xlim=c(0,80)) ``` #### Prior orders set ```{r warning=FALSE} order_products_prior %>% group_by(order_id) %>% summarize(n_items = last(add_to_cart_order)) %>% ggplot(aes(x=n_items))+ geom_histogram(stat="count",fill="red") + geom_rug() + coord_cartesian(xlim=c(0,80)) ``` ### Bestsellers Let's have a look which products are sold most often (top10). And the clear winner is: **Bananas** ```{r fig.height=5.5} tmp <- order_products %>% group_by(product_id) %>% summarize(count = n()) %>% top_n(10, wt = count) %>% left_join(select(products,product_id,product_name),by="product_id") %>% arrange(desc(count)) kable(tmp) tmp %>% ggplot(aes(x=reorder(product_name,-count), y=count))+ geom_bar(stat="identity",fill="red")+ theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank()) ``` ### How often do people order the same items again? 59% of the ordered items are reorders. ```{r warning=FALSE, fig.width=4} tmp <- order_products %>% group_by(reordered) %>% summarize(count = n()) %>% mutate(reordered = as.factor(reordered)) %>% mutate(proportion = count/sum(count)) kable(tmp) tmp %>% ggplot(aes(x=reordered,y=count,fill=reordered))+ geom_bar(stat="identity") ``` ### Most often reordered Now here it becomes really interesting. These 10 products have the highest probability of being reordered. ```{r warning=FALSE, fig.height=5.5} tmp <-order_products %>% group_by(product_id) %>% summarize(proportion_reordered = mean(reordered), n=n()) %>% filter(n>40) %>% top_n(10,wt=proportion_reordered) %>% arrange(desc(proportion_reordered)) %>% left_join(products,by="product_id") kable(tmp) tmp %>% ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+ geom_bar(stat="identity",fill="red")+ theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95)) ``` ### Which item do people put into the cart first? People seem to be quite certain about Multifold Towels and if they buy them, put them into their cart first in 66% of the time. ```{r message=FALSE, fig.height=5.5} tmp <- order_products %>% group_by(product_id, add_to_cart_order) %>% summarize(count = n()) %>% mutate(pct=count/sum(count)) %>% filter(add_to_cart_order == 1, count>10) %>% arrange(desc(pct)) %>% left_join(products,by="product_id") %>% select(product_name, pct, count) %>% ungroup() %>% top_n(10, wt=pct) kable(tmp) tmp %>% ggplot(aes(x=reorder(product_name,-pct), y=pct))+ geom_bar(stat="identity",fill="red")+ theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.4,0.7)) ``` ### Association between time of last order and probability of reorder This is interesting: We can see that if people order again on the same day, they order the same product more often. Whereas when 30 days have passed, they tend to try out new things in their order. ```{r} order_products %>% left_join(orders,by="order_id") %>% group_by(days_since_prior_order) %>% summarize(mean_reorder = mean(reordered)) %>% ggplot(aes(x=days_since_prior_order,y=mean_reorder))+ geom_bar(stat="identity",fill="red") ``` ### Association between number of orders and probability of reordering Products with a high number of orders are naturally more likely to be reordered. However, there seems to be a ceiling effect. ```{r message=FALSE} order_products %>% group_by(product_id) %>% summarize(proportion_reordered = mean(reordered), n=n()) %>% ggplot(aes(x=n,y=proportion_reordered))+ geom_point()+ geom_smooth(color="red")+ coord_cartesian(xlim=c(0,2000)) ``` ### Organic vs Non-organic What is the percentage of orders that are organic vs. not organic? ```{r fig.width=4} products <- products %>% mutate(organic=ifelse(str_detect(str_to_lower(products$product_name),'organic'),"organic","not organic"), organic= as.factor(organic)) tmp <- order_products %>% left_join(products, by="product_id") %>% group_by(organic) %>% summarize(count = n()) %>% mutate(proportion = count/sum(count)) kable(tmp) tmp %>% ggplot(aes(x=organic,y=count, fill=organic))+ geom_bar(stat="identity") ``` ### Reordering Organic vs Non-Organic People more often reorder organic products vs non-organic products. ```{r fig.width=4} tmp <- order_products %>% left_join(products,by="product_id") %>% group_by(organic) %>% summarize(mean_reordered = mean(reordered)) kable(tmp) tmp %>% ggplot(aes(x=organic,fill=organic,y=mean_reordered))+geom_bar(stat="identity") ``` ### Visualizing the Product Portfolio Here is use to treemap package to visualize the structure of instacarts product portfolio. In total there are 21 departments containing 134 aisles. ```{r} library(treemap) tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n()) tmp <- tmp %>% left_join(departments,by="department_id") tmp <- tmp %>% left_join(aisles,by="aisle_id") tmp2<-order_products %>% group_by(product_id) %>% summarize(count=n()) %>% left_join(products,by="product_id") %>% ungroup() %>% group_by(department_id,aisle_id) %>% summarize(sumcount = sum(count)) %>% left_join(tmp, by = c("department_id", "aisle_id")) %>% mutate(onesize = 1) ``` #### How are aisles organized within departments? ```{r, fig.width=9, fig.height=6} treemap(tmp2,index=c("department","aisle"),vSize="onesize",vColor="department",palette="Set3",title="",sortID="-sumcount", border.col="#FFFFFF",type="categorical", fontsize.legend = 0,bg.labels = "#FFFFFF") ``` #### How many unique products are offered in each department/aisle? The size of the boxes shows the number of products in each category. ```{r, fig.width=9, fig.height=6} treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF") ``` #### How often are products from the department/aisle sold? The size of the boxes shows the number of sales. ```{r, fig.width=9, fig.height=6} treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF") ``` ### Exploring Customer Habits Here i look for customers who just reorder the same products again all the time. To search those I look at all orders (excluding the first order), where the percentage of reordered items is exactly 1 (This can easily be adapted to look at more lenient thresholds). We can see there are in fact **3,487** customers, just always reordering products. #### Customers reordering only ```{r} tmp <- order_products_prior %>% group_by(order_id) %>% summarize(m = mean(reordered),n=n()) %>% right_join(filter(orders,order_number>2), by="order_id") tmp2 <- tmp %>% filter(eval_set =="prior") %>% group_by(user_id) %>% summarize(n_equal = sum(m==1,na.rm=T), percent_equal = n_equal/n()) %>% filter(percent_equal == 1) %>% arrange(desc(n_equal)) datatable(tmp2, class="table-condensed", style="bootstrap", options = list(dom = 'tp')) ``` #### The customer with the strongest habit The coolest customer is id #99753, having 97 orders with only reordered items. That's what I call a strong habit. She/he seems to like Organic Milk :-) ```{r warning=FALSE} uniqueorders <- filter(tmp, user_id == 99753)$order_id tmp <- order_products_prior %>% filter(order_id %in% uniqueorders) %>% left_join(products, by="product_id") datatable(select(tmp,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 'tp')) ``` <br> Let's look at his order in the train set. One would assume that he would buy "Organic Whole Milk" and "Organic Reduced Fat Milk": ```{r warning=FALSE} tmp <- orders %>% filter(user_id==99753, eval_set == "train") tmp2 <- order_products %>% filter(order_id == tmp$order_id) %>% left_join(products, by="product_id") datatable(select(tmp2,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 't')) ``` **Tadaaaa. Prediction 100% correct.** <br><br> **Thank you all for the nice comments and upvotes. You are great.**