# %>%

##### magrittr forward-pipe operator

Pipe an object forward into a function or call expression.

##### Usage

`lhs %>% rhs`

##### Arguments

- lhs
- A value or the magrittr placeholder.
- rhs
- A function call using the magrittr semantics.

##### Details

**Using %>% with unary function calls**
When functions require only one argument,

`x %>% f`

is equivalent
to `f(x)`

(not exactly equivalent; see technical note below.)**Placing lhs as the first argument in rhs call**
The default behavior of

`%>%`

when multiple arguments are required
in the `rhs`

call, is to place `lhs`

as the first argument, i.e.
`x %>% f(y)`

is equivalent to `f(x, y)`

.**Placing lhs elsewhere in rhs call**
Often you will want

`lhs`

to the `rhs`

call at another position than the first.
For this purpose you can use the dot (`.`

) as placeholder. For example,
`y %>% f(x, .)`

is equivalent to `f(x, y)`

and
`z %>% f(x, y, arg = .)`

is equivalent to `f(x, y, arg = z)`

.**Using the dot for secondary purposes**
Often, some attribute or property of `lhs`

is desired in the `rhs`

call in
addition to the value of `lhs`

itself, e.g. the number of rows or columns.
It is perfectly valid to use the dot placeholder several times in the `rhs`

call, but by design the behavior is slightly different when using it inside
nested function calls. In particular, if the placeholder is only used
in a nested function call, `lhs`

will also be placed as the first argument!
The reason for this is that in most use-cases this produces the most readable
code. For example, `iris %>% subset(1:nrow(.) %% 2 == 0)`

is
equivalent to `iris %>% subset(., 1:nrow(.) %% 2 == 0)`

but
slightly more compact. It is possible to overrule this behavior by enclosing
the `rhs`

in braces. For example, `1:10 %>% {c(min(.), max(.))}`

is
equivalent to `c(min(1:10), max(1:10))`

.

**Using %>% with call- or function-producing rhs**
It is possible to force evaluation of

`rhs`

before the piping of `lhs`

takes
place. This is useful when `rhs`

produces the relevant call or function.
To evaluate `rhs`

first, enclose it in parentheses, i.e.
`a %>% (function(x) x^2)`

, and `1:10 %>% (call("sum"))`

.
Another example where this is relevant is for reference class methods
which are accessed using the `$`

operator, where one would do
`x %>% (rc$f)`

, and not `x %>% rc$f`

.**Using lambda expressions with %>%**
Each

`rhs`

is essentially a one-expression body of a unary function.
Therefore defining lambdas in magrittr is very natural, and as
the definitions of regular functions: if more than a single expression
is needed one encloses the body in a pair of braces, `{ rhs }`

.
However, note that within braces there are no "first-argument rule":
it will be exactly like writing a unary function where the argument name is
"`.`

" (the dot).**Using the dot-place holder as lhs**
When the dot is used as

`lhs`

, the result will be a functional sequence,
i.e. a function which applies the entire chain of right-hand sides in turn
to its input. See the examples.
##### Technical notes

The magrittr pipe operators use non-standard evaluation. They capture
their inputs and examines them to figure out how to proceed. First a function
is produced from all of the individual right-hand side expressions, and
then the result is obtained by applying this function to the left-hand side.
For most purposes, one can disregard the subtle aspects of magrittr's
evaluation, but some functions may capture their calling environment,
and thus using the operators will not be exactly equivalent to the
"standard call" without pipe-operators. Another note is that special attention is advised when using non-magrittr
operators in a pipe-chain (`+, -, $,`

etc.), as operator precedence will impact how the
chain is evaluated. In general it is advised to use the aliases provided
by magrittr.

##### See Also

##### Examples

`library(magrittr)`

```
# Basic use:
iris %>% head
# Use with lhs as first argument
iris %>% head(10)
# Using the dot place-holder
"Ceci n'est pas une pipe" %>% gsub("une", "un", .)
# When dot is nested, lhs is still placed first:
sample(1:10) %>% paste0(LETTERS[.])
# This can be avoided:
rnorm(100) %>% {c(min(.), mean(.), max(.))} %>% floor
# Lambda expressions:
iris %>%
{
size <- sample(1:10, size = 1)
rbind(head(., size), tail(., size))
}
# renaming in lambdas:
iris %>%
{
my_data <- .
size <- sample(1:10, size = 1)
rbind(head(my_data, size), tail(my_data, size))
}
# Building unary functions with %>%
trig_fest <- . %>% tan %>% cos %>% sin
1:10 %>% trig_fest
trig_fest(1:10)
```

*Documentation reproduced from package magrittr, version 1.5, License: MIT + file LICENSE*

### Community examples

**luciabertova.lb@gmail.com**at Mar 6, 2019 magrittr v1.5

--- title: "Exploratory Analysis - Instacart" author: "Philipp Spachtholz" output: html_document: fig_height: 4 fig_width: 7 theme: cosmo --- ### Welcome and good luck to you all at Instacart Market Basket Competition! Here is a first exploratory analysis of the competition dataset. On its website Instacart has a recommendation feature, suggesting the users some items that he/she may buy again. Our task is to predict which items will be reordered on the next order. The dataset consists of information about 3.4 million grocery orders, distributed across 6 csv files. ### Read in the data ```{r message=FALSE, warning=FALSE, results='hide'} library(data.table) library(dplyr) library(ggplot2) library(knitr) library(stringr) library(DT) orders <- fread('../input/orders.csv') products <- fread('../input/products.csv') order_products <- fread('../input/order_products__train.csv') order_products_prior <- fread('../input/order_products__prior.csv') aisles <- fread('../input/aisles.csv') departments <- fread('../input/departments.csv') ``` ```{r include=FALSE} options(tibble.width = Inf) ``` Lets first have a look at these files: ### Peek at the dataset {.tabset} #### orders This file gives a list of all orders we have in the dataset. 1 row per order. For example, we can see that user 1 has 11 orders, 1 of which is in the train set, and 10 of which are prior orders. The orders.csv doesn't tell us about which products were ordered. This is contained in the order_products.csv ```{r, result='asis'} kable(head(orders,12)) glimpse(orders) ``` #### order_products_train This file gives us information about which products (product_id) were ordered. It also contains information of the order (add_to_cart_order) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0). For example, we see below that order_id 1 had 8 products, 4 of which are reorders. Still we don't know what these products are. This information is in the products.csv ```{r} kable(head(order_products,10)) glimpse(order_products) ``` #### products This file contains the names of the products with their corresponding product_id. Furthermore the aisle and deparment are included. ```{r} kable(head(products,10)) glimpse(products) ``` #### order_products_prior This file is structurally the same as the other_products_train.csv. ```{r, result='asis'} kable(head(order_products_prior,10)) glimpse(order_products_prior) ``` #### aisles This file contains the different aisles. ```{r, result='asis'} kable(head(aisles,10)) glimpse(aisles) ``` #### departments ```{r, result='asis'} kable(head(departments,10)) glimpse(departments) ``` ### Recode variables We should do some recoding and convert character variables to factors. ```{r message=FALSE, warning=FALSE} orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set)) products <- products %>% mutate(product_name = as.factor(product_name)) aisles <- aisles %>% mutate(aisle = as.factor(aisle)) departments <- departments %>% mutate(department = as.factor(department)) ``` ### When do people order? Let's have a look when people buy groceries online. #### Hour of Day There is a clear effect of hour of day on order volume. Most orders are between 8.00-18.00 ```{r warning=FALSE} orders %>% ggplot(aes(x=order_hour_of_day)) + geom_histogram(stat="count",fill="red") ``` #### Day of Week There is a clear effect of day of the week. Most orders are on days 0 and 1. Unfortunately there is no info regarding which values represent which day, but one would assume that this is the weekend. ```{r warning=FALSE} orders %>% ggplot(aes(x=order_dow)) + geom_histogram(stat="count",fill="red") ``` ### When do they order again? People seem to order more often after exactly 1 week. ```{r warning=FALSE} orders %>% ggplot(aes(x=days_since_prior_order)) + geom_histogram(stat="count",fill="red") ``` ### How many prior orders are there? We can see that there are always at least 3 prior orders. ```{r} orders %>% filter(eval_set=="prior") %>% count(order_number) %>% ggplot(aes(order_number,n)) + geom_line(color="red", size=1)+geom_point(size=2, color="red") ``` ### How many items do people buy? {.tabset} Let's have a look how many items are in the orders. We can see that people most often order around 5 items. The distributions are comparable between the train and prior order set. #### Train set ```{r warning=FALSE} order_products %>% group_by(order_id) %>% summarize(n_items = last(add_to_cart_order)) %>% ggplot(aes(x=n_items))+ geom_histogram(stat="count",fill="red") + geom_rug()+ coord_cartesian(xlim=c(0,80)) ``` #### Prior orders set ```{r warning=FALSE} order_products_prior %>% group_by(order_id) %>% summarize(n_items = last(add_to_cart_order)) %>% ggplot(aes(x=n_items))+ geom_histogram(stat="count",fill="red") + geom_rug() + coord_cartesian(xlim=c(0,80)) ``` ### Bestsellers Let's have a look which products are sold most often (top10). And the clear winner is: **Bananas** ```{r fig.height=5.5} tmp <- order_products %>% group_by(product_id) %>% summarize(count = n()) %>% top_n(10, wt = count) %>% left_join(select(products,product_id,product_name),by="product_id") %>% arrange(desc(count)) kable(tmp) tmp %>% ggplot(aes(x=reorder(product_name,-count), y=count))+ geom_bar(stat="identity",fill="red")+ theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank()) ``` ### How often do people order the same items again? 59% of the ordered items are reorders. ```{r warning=FALSE, fig.width=4} tmp <- order_products %>% group_by(reordered) %>% summarize(count = n()) %>% mutate(reordered = as.factor(reordered)) %>% mutate(proportion = count/sum(count)) kable(tmp) tmp %>% ggplot(aes(x=reordered,y=count,fill=reordered))+ geom_bar(stat="identity") ``` ### Most often reordered Now here it becomes really interesting. These 10 products have the highest probability of being reordered. ```{r warning=FALSE, fig.height=5.5} tmp <-order_products %>% group_by(product_id) %>% summarize(proportion_reordered = mean(reordered), n=n()) %>% filter(n>40) %>% top_n(10,wt=proportion_reordered) %>% arrange(desc(proportion_reordered)) %>% left_join(products,by="product_id") kable(tmp) tmp %>% ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+ geom_bar(stat="identity",fill="red")+ theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95)) ``` ### Which item do people put into the cart first? People seem to be quite certain about Multifold Towels and if they buy them, put them into their cart first in 66% of the time. ```{r message=FALSE, fig.height=5.5} tmp <- order_products %>% group_by(product_id, add_to_cart_order) %>% summarize(count = n()) %>% mutate(pct=count/sum(count)) %>% filter(add_to_cart_order == 1, count>10) %>% arrange(desc(pct)) %>% left_join(products,by="product_id") %>% select(product_name, pct, count) %>% ungroup() %>% top_n(10, wt=pct) kable(tmp) tmp %>% ggplot(aes(x=reorder(product_name,-pct), y=pct))+ geom_bar(stat="identity",fill="red")+ theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.4,0.7)) ``` ### Association between time of last order and probability of reorder This is interesting: We can see that if people order again on the same day, they order the same product more often. Whereas when 30 days have passed, they tend to try out new things in their order. ```{r} order_products %>% left_join(orders,by="order_id") %>% group_by(days_since_prior_order) %>% summarize(mean_reorder = mean(reordered)) %>% ggplot(aes(x=days_since_prior_order,y=mean_reorder))+ geom_bar(stat="identity",fill="red") ``` ### Association between number of orders and probability of reordering Products with a high number of orders are naturally more likely to be reordered. However, there seems to be a ceiling effect. ```{r message=FALSE} order_products %>% group_by(product_id) %>% summarize(proportion_reordered = mean(reordered), n=n()) %>% ggplot(aes(x=n,y=proportion_reordered))+ geom_point()+ geom_smooth(color="red")+ coord_cartesian(xlim=c(0,2000)) ``` ### Organic vs Non-organic What is the percentage of orders that are organic vs. not organic? ```{r fig.width=4} products <- products %>% mutate(organic=ifelse(str_detect(str_to_lower(products$product_name),'organic'),"organic","not organic"), organic= as.factor(organic)) tmp <- order_products %>% left_join(products, by="product_id") %>% group_by(organic) %>% summarize(count = n()) %>% mutate(proportion = count/sum(count)) kable(tmp) tmp %>% ggplot(aes(x=organic,y=count, fill=organic))+ geom_bar(stat="identity") ``` ### Reordering Organic vs Non-Organic People more often reorder organic products vs non-organic products. ```{r fig.width=4} tmp <- order_products %>% left_join(products,by="product_id") %>% group_by(organic) %>% summarize(mean_reordered = mean(reordered)) kable(tmp) tmp %>% ggplot(aes(x=organic,fill=organic,y=mean_reordered))+geom_bar(stat="identity") ``` ### Visualizing the Product Portfolio Here is use to treemap package to visualize the structure of instacarts product portfolio. In total there are 21 departments containing 134 aisles. ```{r} library(treemap) tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n()) tmp <- tmp %>% left_join(departments,by="department_id") tmp <- tmp %>% left_join(aisles,by="aisle_id") tmp2<-order_products %>% group_by(product_id) %>% summarize(count=n()) %>% left_join(products,by="product_id") %>% ungroup() %>% group_by(department_id,aisle_id) %>% summarize(sumcount = sum(count)) %>% left_join(tmp, by = c("department_id", "aisle_id")) %>% mutate(onesize = 1) ``` #### How are aisles organized within departments? ```{r, fig.width=9, fig.height=6} treemap(tmp2,index=c("department","aisle"),vSize="onesize",vColor="department",palette="Set3",title="",sortID="-sumcount", border.col="#FFFFFF",type="categorical", fontsize.legend = 0,bg.labels = "#FFFFFF") ``` #### How many unique products are offered in each department/aisle? The size of the boxes shows the number of products in each category. ```{r, fig.width=9, fig.height=6} treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF") ``` #### How often are products from the department/aisle sold? The size of the boxes shows the number of sales. ```{r, fig.width=9, fig.height=6} treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF") ``` ### Exploring Customer Habits Here i look for customers who just reorder the same products again all the time. To search those I look at all orders (excluding the first order), where the percentage of reordered items is exactly 1 (This can easily be adapted to look at more lenient thresholds). We can see there are in fact **3,487** customers, just always reordering products. #### Customers reordering only ```{r} tmp <- order_products_prior %>% group_by(order_id) %>% summarize(m = mean(reordered),n=n()) %>% right_join(filter(orders,order_number>2), by="order_id") tmp2 <- tmp %>% filter(eval_set =="prior") %>% group_by(user_id) %>% summarize(n_equal = sum(m==1,na.rm=T), percent_equal = n_equal/n()) %>% filter(percent_equal == 1) %>% arrange(desc(n_equal)) datatable(tmp2, class="table-condensed", style="bootstrap", options = list(dom = 'tp')) ``` #### The customer with the strongest habit The coolest customer is id #99753, having 97 orders with only reordered items. That's what I call a strong habit. She/he seems to like Organic Milk :-) ```{r warning=FALSE} uniqueorders <- filter(tmp, user_id == 99753)$order_id tmp <- order_products_prior %>% filter(order_id %in% uniqueorders) %>% left_join(products, by="product_id") datatable(select(tmp,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 'tp')) ``` <br> Let's look at his order in the train set. One would assume that he would buy "Organic Whole Milk" and "Organic Reduced Fat Milk": ```{r warning=FALSE} tmp <- orders %>% filter(user_id==99753, eval_set == "train") tmp2 <- order_products %>% filter(order_id == tmp$order_id) %>% left_join(products, by="product_id") datatable(select(tmp2,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 't')) ``` **Tadaaaa. Prediction 100% correct.** <br><br> **Thank you all for the nice comments and upvotes. You are great.**