# Hello, Docker

At first glance, the subject is not simple and it takes some time to understand and grasp the package. To complicate matters, it also deals with Docker. Shit, I know nothing about Docker, captain. Well, it’s time to roll up those sleeves and get started, matey! Oh, I forgot, the objective is to create a data pipeline, i.e. a basic ELT (Extract-Load-Transform) but extensive enough to deal with the topic in greater depth.

# The purpose

For an analysis, I need to extract product data (promotions, prices, stock, etc.) from an online food and groceries sales site and make a competitive analysis dashboard. A practice that Colruyt, a Belgian supermarket chain, has been using for a long time to guarantee the lowest prices. Whether it is on a brochure or online, prices are recorded daily, analysed and compared. It is important to note that I will not share the scrapping code in order to preserve the peace of mind of the company’s servers, especially as this is not the subject of this article.

I made a small representation of the flow on figma :

I invite you to go to the package’s site and check out a few pages before continuing this article. Review the Use Cases including “Scheduled R scripts in the Cloud” to give you an intuition of the flexibility of the GCP service and the possibilities. If you are comfortable with all of the content presented, don’t waste your time reading this article, but if you want to take the time to learn about the features, process and points of attention, then keep reading. Make yourself comfortable. You are my guests.

After a first look at GCP, you may have noticed that it has 3 tabs:

• Cloud Build
• Cloud Run
• Cloud Scheduler

These are the 3 pillars of this package dedicated to the 3 GCP services.

Cloud build allows you to run a build and produce artifacts, i.e. generate a docker image by executing tasks or scripts in steps. The advantage is that it is also possible to run an R script as long as the docker image contains a version of R. Artifacts are the elements produced by the code in question such as a table or a .csv file, etc.

Cloud Run allows you to deploy an image in a container without worrying about the infrastructure. Useful if you want to make an API or deploy a Shiny application.

Cloud Scheduler allows you to configure CRON jobs with CRON syntax to run jobs via HTTP or Pub/Sub.

For this case study, we will use the Build and Scheduler services. We will leave Run for another article (or not :-) ).

# Setup

Mark has made a long video detailing the setup steps. This is an essential step if you want to take advantage of the package and Google services. If you don’t have a GCP account yet, you can create one. The advantage is that you get $300 as a start-up offer in addition to the free offer for each service. For example, you get 120 min per day of free build. That’s enough to have fun already. Below you will find a summary of the setup process: 1. Open a new project on GCP 2. Create a consent screen 3. Give the app a name 4. Activate ../auth/cloud-platform API 5. Create a user email 6. Create a 0Auth ID in .json (Desktop App) 7. Enable Services APIs 8. Setup Wizard 9. Lauch Test Once everything is properly configured and working, we can move on to the next step. # Docker : it works Cloud Build uses the Docker engine to execute the compilation steps (of your build) in containers. Docker is therefore an essential part of the process. The idea of Docker is simple; it allows you to run applications and dependencies in containers, thus in isolation from the host environment or other containers. This guarantees reliable and predictable execution of its application on a wide variety of host servers. Arben Kqiku details the subject for the sake of his article. You can find other reference articles with the package documentation. Arben reviews : • GCP configuration • The configuration of BigQuery • Creating a local Docker image • Running the image locally with RStudio Server • Configuring {googleCloudRunner} • Running a cloud build with the image • Creating a CRON task Although the article is comprehensive, it does not address some of the features of the {googleCloudRunner} package, which we will do here. To structure the study, we will use the official Google documentation. But first, let’s test Docker. # Rstudio in a container To understand what’s going on and the rest of the tutorial, we will, as in the Arben article, run an existing image of a container provided by the Rocker-Project. Don’t forget to install Docker first. If you’re a Mac user like me, the installation is explained here. docker run --rm -p 8787:8787 -e PASSWORD=12345 rocker/rstudio  Congratulations, you have just deployed your first container. If we detail what happened, you have run the run command to create a container as a layer on top of the rocker/rstudio image which you didn’t have locally and therefore was downloaded beforehand. A number of options exist which I will not go into. Please refer to the official documentation for more details. For the options in this example we have : -rm : to delete the container on exit -p : to map the container ports to the host -e : to specify environment variables # Dockerfile You can also create a new image based on an existing container image. For this we need a Dockerfile with no extension. The Dockerfile file gives the necessary instructions for creating your image. Arben suggests putting a .txt extension to the Dockerfile, which is a mistake. Indeed, you should not add an extension or you will get an error. Here is the example he gives: FROM rocker/tidyverse:latest ENV PASSWORD=123 ENV PORT=8787 RUN R -e "install.packages('bigrquery', repos = 'http://cran.us.r-project.org')" ADD docker-tutorial-service.json /home/rstudio ADD big-query-tutorial.R /home/rstudio CMD Rscript /home/rstudio/big-query-tutorial.R  We tell docker to build the new image from the rocker/tidyverse image, set the PASSWORD and PORT variables in the environment, run the expression with R (-e indicates to run the expression and then exit R) which consists of installing the {bigrquery} package, copy the .json file and the .R script to the /home/rstudio folder in the container and then run the same script previously copied with the Rscript command. To build the image, simply type the following build command from the folder where your Dockerfile is located: docker build -t docker-tutorial -f Dockerfile .  The -t option will tag the image with the name docker-tutorial. This new image can be called by name (easier) and used to generate a new container. The . indicates that the Dockerfile is in the root of the folder you are in. As you can see, it’s quite simple. To create the container and in the process run your script, you just need to run the following command as seen above: docker run --rm docker-tutorial  The advantage of docker hub is that you can share your image with the world and be pretty sure that your script, your Machine Learning process, your shiny app or your pipeline will run correctly. BAM ! # Cloud Build Now that you understand the principle, we can move on to Cloud Build. Rather than building your image locally, we’ll build it in the cloud. There are 2 ways to build a docker image with Cloud Build (actually 3, the last one using Cloud Native Buildpacks but we won’t dwell on that third one): 1. With a Dockerfile as we saw earlier. 2. With a build config file. ## 1. With a Dockerfile To run the build of your container image in the cloud you can with the help of the gcloud tool run this command from the directory where your Dockerfile is located. PROJECT_ID being the ID of your GCP project and IMAGE_NAME, the name you want to give to your image : gcloud builds submit --tag gcr.io/PROJECT_ID/IMAGE_NAME:tag  From R with {googleCloudRunner} we will use the cr_deploy_docker() function: cr_deploy_docker(local = "chemin_vers_fichier_dockerfile", image_name = "gcr.io/PROJECT_ID/IMAGE_NAME")  You can see how simple it is. To make the example concrete, we will use the example from the build documentation from RStudio. • Creating a folder • Creating the Dockerfile FROM alpine COPY quickstart.sh / CMD ["sh", "/quickstart.sh"]  • Creating the script echo "Hello, R! The time is$(date)."

• Running the cr_deploy_docker() command
cr_deploy_docker(local = "docker-basic-1/", image_name = "docker-basic-image", tag = "video")


## 2. With a build configuration file

This second method consists in writing a YAML file including your build configuration in the form of steps. This is the equivalent of docker compose but for Cloud Build. Here is what a build file looks like:

steps:
- name: 'gcr.io/PROJECT_ID/docker-basic-image'
args: [ 'build', '-t', 'gcr.io/PROJECT_ID/docker-basic-image:video', '.' ]
images:
- 'gcr.io/PROJECT_ID/docker-basic-image:video'


Then run the following command with gcloud() :

gcloud builds submit --config votre_fichier_config.yaml


This is where the {googleCloudRunner} package offers a series of functions that will help us create this YAML configuration file.

{googleCloudRunner} offers different families of functions. Among these families, 3 will be useful to us quite quickly.

### 1. cr_deploy_*

This family allows you to quickly deploy applications without calling the cr_build_* or cr_buildstep_* functions. This family is a sort of shortcut family. For example, as we saw above, cr_deploy_docker() allows you to deploy an application from a Dockerfile directly. There are other functions that you may find useful too, such as the cr_deploy_pkgdown() variation. Then of course there is cr_deploy_r() to deploy an image with the execution of an R script. This last function also has an argument to add a CRON task with Cloud Scheduler. The documentation is quite complete, so I invite you to refer to it to see all the arguments of the function. In case you only want to run a script in the cloud with a CRON task, this function is more than enough.

To top it all off, Mark added a cr_deploy_gadget() function to launch a Shiny Gadget to configure your build based on your script on a rocker image of your choice.

### 2. cr_build_*

This family provides you with all the functions to build and manipulate your build. cr_build() will take your YAML configuration file as the main argument. You will find functions like cr_build_yaml() to assemble your YAM file from steps or cr_build_artifacts() to allow you to download your build artifacts (these are the files generated by your build : tables, scripts, plot images, etc.) or cr_build_upload_gcs() to upload files needed to run your script by creating a StorageSource on Google Cloud Storage, etc.

### 3. cr_buildstep_*

This family mainly provides you with functions to define the steps of your build in your build config, remember the YAML config file. Except, it is not necessary to save these steps actually in a YAML file. The cr_build_yaml() function will use the steps defined by the cr_buildstep_* functions as arguments, as in this example:

my_config_yaml <- cr_build_yaml(steps = c(
cr_buildstep("docker", c("build","-t",image,".")),
cr_buildstep("docker", c("push",image)),
cr_buildstep("gcloud", c("beta","run","deploy", "test1", "--image", image))),
images = image)

cr_build(my_config_yaml)



Simple, isn’t it? If we look at the output of my_config_yaml, we see this :

my_config_yaml

#> ==cloudRunnerYaml==
#> steps:
#> - name: gcr.io/cloud-builders/docker
#>   args:
#>   - build
#>   - -t
#>   - gcr.io/my-project/my-image
#>   - '.'
#> - name: gcr.io/cloud-builders/docker
#>   args:
#>   - push
#>   - gcr.io/my-project/my-image
#> - name: gcr.io/cloud-builders/gcloud
#>   args:
#>   - beta
#>   - run
#>   - deploy
#>   - test1
#>   - --image
#>   - gcr.io/my-project/my-image
#> images:
#> - gcr.io/my-project/my-image


## Store the images in the Container Registry

The image created will be stored using step 2 containing the docker “push” command and added to the config via cr_buildstep("docker", c("push",image)). The image will be in your Container Registry unless you specify a path to the Artifact Registry (an evolution of the container registry). There are some differences between the two registries but in most cases, using the Container Registry is sufficient. You can see the differences in this stackoverflow post. The “images” argument to the cr_build_yaml() function will display the image in the build results. This includes the Build Description page of a build in Cloud Console, the results of Build.get() and the results of gcloud builds list. However, if you use the Docker push command to store the compiled image, it will not be displayed in the build results. I must admit that I didn’t understand the point of using both. In my opinion, using the “images” argument is sufficient.

I think it’s a good idea to use the “image” argument to keep the image and then run the application with Cloud Run. It can be a Shiny application or an API with Plumber.

## Recovering artifacts

As mentioned above, artifacts are the files produced by your build. You can choose to keep them in GCS or a private repo. To do this, you need to add a step to your build with the cr_build_yaml_artifact() function

cr_build_yaml_artifact(paths, bucket_dir = NULL, bucket = cr_bucket_get())


This function is added as the value of the artifacts argument of the build function and not as an argument value for steps.

cr_build_yaml(
steps = cr_buildstep_r(r),
artifacts = cr_build_yaml_artifact('artifact.csv', bucket = "my-bucket")
)


Thanks to this addition, your data output from your script will be stored. If you have several files, it is possible to use wildcard names :

cr_build_yaml(
steps = cr_buildstep_r(r),
artifacts = cr_build_yaml_artifact('*.csv', bucket = "my-bucket")
)


It is also possible to use substitution variables.

This step allows you to retrieve source files included in your docker folder. These can be data or functions that you want to be able to run from your container. To do this the cr_build_upload_gcs() function will copy the files from your defined folder to GCS. At build time, the files will be retrieved from GCS.

my_gcs_source <- cr_build_upload_gcs("my_folder")
build1 <- cr_build("cloudbuild.yaml", source = my_gcs_source)


Source files can also come from Cloud Source Repository (google git repo).

# Our data pipeline

## Scraping

We have all the elements to build our data pipeline or ELT (as you like). Here are the steps to follow:

### 1. Create our docker folder and script folder

First step, create your docker folder with the necessary files.

mkdir my-pipeline
mkdir script


### 2. Building your image with Dockerfile

Very simple step. We will take the rocker/tidyverse image and add some packages. Just create a text file in RStudio and write our Docker command lines to it, then save our file as Dockerfile.

FROM rocker/tidyverse:latest

RUN R -e "install.packages(c('bigrquery', 'furrr', 'janitor', 'rvest', 'httr', 'jsonlite', 'logr'))"


You will notice that I install the following packages for the project: furrr, janitor, rvest, httr, jsonlite, logr. Now we just have to launch the build. The image will then be accessible in the Container Registry.

cr_deploy_docker(local = "my-pipeline/", image_name = "eshop", tag = "demo")


### 3. Including our script

To extract data from the online eshop, I wrote a get_eshop_products_f() function using the {rvest} package. As explained, I’m not going to put the details of the function here, but rather the script that will be run on each build calling the function in question. The script will be saved in the my-pipeline directory created earlier for the Dockerfile. The function, on the other hand, will be saved in a script directory. Why? Because we will be creating a source on GCS from the local script folder. The files in the folder will then be copied from the source to the container in a deploy folder. The function itself will be called via the source() function. I also adapted the function to run in parallel for scraping and save time (reason for the plan(multisession)). But I don’t feel that it works at build time on GCP. In any case I don’t see the difference in speed during the scraping. Maybe it is related to the number of cores in the virtual machine for the build?

I also apply a modification of the column names with clean_names() from {janitor}. Finally, I save my future artifact as an .RDS file with the prefix last_products_ and the date in DDMMYYYY_HHMMSS format. We will have a file of type last_products_22042021_220000.rds.

storage <- cr_build_upload_gcs(local = "script")

library(tidyverse)
library(furrr)

source("deploy/script/get_eshop_products_f.R")

plan(multisession)
products <- get_eshop_products_f()
p1 <- products %>% select(!where(is.list)) %>% rowid_to_column() %>% janitor::clean_names()

saveRDS(p1, paste0("last_products_", format(Sys.time(), "%d%m%Y_%H%M%S"), ".rds"))



### 4. Creating the steps for the build

Now we will configure our build. We could do without this step if we didn’t need to retrieve the artifacts. But first let’s save the path to the image in a vector.

eshop_img <- "gcr.io/cloudrun-308213/eshop:demo"


Then write the steps with the help of the cr_buildstep_* functions. Don’t forget to include the retrieval of the artifacts in your project’s bucket folder. We use * to include the saved .RDS file regardless of the date suffix.

my_config <- cr_build_yaml(
steps = c(cr_buildstep_r("my-pipeline/my_script.R", name = eshop_img)),
artifacts = cr_build_yaml_artifact("last_products_*.rds", bucket_dir = "eshop/"),
timeout = 600
)


### 5. Creating a build object

To configure the CRON task, we will create an object from the build config and our CloudStorage object. The argument launch_browser = TRUE indicates to launch the browser with the build page on GCP live.

scrape_eshop <- cr_build_make(my_config, source = storage)


### 6. Launching the build

That’s it! It’s time to launch the build and see if everything works properly.

cr_build(scrape_eshop)


### 7. Creating the CRON task with Cloud Scheduler

To feed our pipeline, we want to launch our build every hour. We will therefore have an inventory status every hour. To achieve this we need to create an HTTP endpoint for cloud scheduler with the cr_build_schedule_http() function.

scrape_eshop_http <- cr_build_schedule_http(scrape_eshop)


Once your endpoint is done, all you have to do is schedule the frequency of the build. You will have to write your cron job with the specific notation. If you don’t know how this works, go to crontab.guru. There you can practice CRON notation. In our case, it will be written like this 0 */1 * *. We indicate that we want to run our build every hour at minute 0.

cr_schedule("0 */1 * * *", name="get_eshop_products", httpTarget = scrape_eshop_http, overwrite = TRUE)


The task is scheduled! If you want to unschedule it, just go to the cloud console or simply run:

cr_schedule_delete("get_eshop_products")


Again, the representation of the flow :

This is the first part. We have learned how to :

• use Docker
• use the {googleCloudRunner} package
• Create an image with Dockerfile
• Create a build from the created image containing our script
• Use a data source for your build
• How to get the artifacts
• How to program your build

# Introduction

## The strategy

In order to have a complete table updated every 12 hours with all the tables created, I had to think about a strategy. This is of course one way of looking at things. If you have a better approach, let me know. Let’s summarise the situation. Every hour we have a new table with the inventory in stock at the store. The tables accumulate and every 12 hours we create a new complete table which we will call last_product taking into account the previous complete table but with the 12 new tables available and so on.

## Preparation

An init step is necessary and it will be manual. It consists in generating the table and the starting index (the listing of all the existing tables). We will make a table called last_product with all the tables already present in the cloud but locally. Then we’ll upload last_product and last_index into our eshop bucket directory.

The gcs_list_objects() function comes from the {googleCloudStorageR} package. It lists all the available files. I arrange the table by date, keep the necessary files containing last_products_. with the str_detect() function and finally add a filename column for the local file storage path. For this to work, the storage folder must exist in the working directory.

last_index <- gcs_list_objects(prefix = "eshop", delimiter = "json") %>%
arrange(updated) %>%
filter(str_detect(name, "last_products_.")) %>%
mutate(filename = paste0("storage/", str_sub(name, 10)))

last_index %>%
mutate(download = walk2(name, filename, ~ gcs_get_object(.x, saveToDisk = .y, overwrite = FALSE))


At the time of writing, there are over 1,000 tables. It will take some time to repatriate. Then I save all the .RDS files in a table with map_df(). The size of the last_products file is over 700 mb. That’s why the process should be done from the beginning (but that’s another story).

last_products <- last_index %>%
pull(filename) %>%


All that remains is to save the 2 files locally and then upload them to GCS with gcs_upload().

saveRDS(last_index, "last_index.rds")
saveRDS(last_products, "last_products.rds")

gcs_upload(file = "last_index.rds", name = "eshop/last_index.rds")
gcs_upload(file = "last_products.rds", name = "eshop/last_products.rds")



## Process

Now we will detail the script to update the last_products table on each build launch on Cloud Build.

### 1. Transfer the source files to the build

Here I am transferring my 2 files last_products and last_index from GCS.

gcs_get_object(object_name = "gs://cloudrunner_bucket_2/eshop/last_index.rds", saveToDisk = "last_index.rds",overwrite = TRUE)
gcs_get_object(object_name = "gs://cloudrunner_bucket_2/eshop/last_products.rds", saveToDisk = "last_products.rds", overwrite = TRUE)


I load the index into the open R session.

last_index <- readRDS("last_index.rds")


### 3. Create a new index

I create a new index with all the existing files.

new_index <- gcs_list_objects(prefix = "eshop", delimiter = "json") %>%
arrange(updated) %>%
filter(str_detect(name, "last_products_.")) %>%
mutate(filename = str_sub(name, 10))


### 4. Keeping the difference

I keep in a table the difference between the 2 indexes with the anti_join() function.

diff_index <- new_index %>% anti_join(last_index)


### 5. The new table with the new files

I get the new files from GCS. If the script runs every hour, there will only be one file downloaded per build. If it is every day, then it would be 24 files.

diff_index %>%
mutate(download = walk2(name, filename, ~ gcs_get_object(.x, saveToDisk = .y, overwrite = TRUE)))


### 6. Create the new_products table

I create, as we saw above, a new table consisting of the .RDS downloaded locally in the build. Then a bind of the 2 tables.

diff_products <- diff_index %>%
pull(filename) %>%

new_products <- last_products %>% bind_rows(diff_products)


### 7. Save the new table and index.

The trick lies in saving the files by overwriting the old ones.

saveRDS(new_index, "last_index.rds")
saveRDS(new_products, "last_products.rds")


### 8. Write the procedure in a script

The objective is to create a new build. So we need to put all these steps together in a script that will be called when each build is launched. I use the gcs_auth() function to give the rights to retrieve the 2 files in the session. I could also instead add, as we saw earlier, to the source argument of the cr_build_make(my_config, source = my_source) function the object containing the link to the data needed for the build previously uploaded with the cr_build_upload_gcs(local = "my_source") function. Actually, that’s what I did with the auth.json in the script. It is used to authenticate the session with my credentials.

library(tidyverse)

# Gcloud auth
gcs_auth("deploy/auth.json")
gcs_global_bucket("cloudrunner_bucket_2")

#transfer file
gcs_get_object(object_name = "gs://cloudrunner_bucket_2/eshop/last_index.rds", saveToDisk = "last_index.rds",overwrite = TRUE)
gcs_get_object(object_name = "gs://cloudrunner_bucket_2/eshop/last_products.rds", saveToDisk = "last_products.rds", overwrite = TRUE)

# create new index
new_index <- gcs_list_objects(prefix = "eshop", delimiter = "json") %>%
arrange(updated) %>%
filter(str_detect(name, "last_products_.")) %>%
mutate(filename = str_sub(name, 10))

# create diff index
diff_index <- new_index %>% anti_join(last_index)

# get the new files from the diff index
diff_index %>% mutate(download = walk2(name, filename, ~ gcs_get_object(.x, saveToDisk = .y, overwrite = TRUE)))

# merge the files to a new object
diff_products <- diff_index %>%
pull(filename) %>%

# merge last_products object with the new data frame
new_products <- last_products %>% bind_rows(diff_products)

# save the new file with the right name to replace old ones at the end of the build process (via artefact)
saveRDS(new_index, "last_index.rds")
saveRDS(new_products, "last_products.rds")


### 9. Configure the build

All that remains is to write the build configuration process. I’ll first create a new docker-merge-table folder with the script seen above. I target the base image to use (the same as the one for the first step). Then I define the YAML with cr_buildstep_r() and the path of the script and as an argument artifacts the 2 output files of the script that will be transferred to GCS and overwrite the old ones.

my_img <- "gcr.io/cloudrun-308213/eshop:demo"

# build process image
merge_config <- cr_build_yaml(
steps = c(cr_buildstep_r("docker-merge-table/merge_script.R", name = my_img)),
artifacts = cr_build_yaml_artifact(list("last_index.rds", "last_products.rds"), bucket_dir = "eshop/"), timeout = 600
)


### 10. Launch the build

In order to test if everything is working correctly, we just need to run this command. If everything goes well, i.e. the last_products file is updated correctly, we can move on to setting up a CRON task.

cr_build(merge_config, launch_browser = TRUE, source = auth_storage)


This graph illustrates the initialization phase and then

### 11. Setting up the CRON task

Last step is to do a build_object with cr_build_schedule_http() and cr_dchedule.

# make a build object
merge_products <- cr_build_make(merge_config, source = auth_storage)

# cloud scheduler http endpoint
merge_products_http <- cr_build_schedule_http(merge_products)

# Launch cron
cr_schedule("15 0/12 * * *", name = "merge_products", httpTarget = merge_products_http)


For this second part. We learned how to :

• use a table merge strategy to build a complete new table
• apply the process to build its build as in step 1

# Introduction

### 1. Build configuration

We find our two configuration lines for our build. The first step copies the rmd file for our build from our bucket with gsutil and the cp value. The second step renders the .rmd file to HTML to our flexdashboard. There is also the artifacts = argument as seen earlier.

rmd_config <- cr_build_yaml(
steps = c(
cr_buildstep("gsutil", args = c("cp","gs://cloudrunner_bucket_2/eshop_exploration.Rmd","eshop_exploration.Rmd")),
cr_buildstep_r(r = "rmarkdown::render('eshop_exploration.Rmd', output_file = 'eshop_exploration.html')", name = my_img)
),
artifacts = cr_build_yaml_artifact("eshop_exploration.html", bucket_dir = "eshop/"), timeout = 600
)


### 2. Launch the build

To check that the build is running correctly, we can run it manually with the build function.

cr_build(rmd_config, launch_browser = TRUE, source = storage)


### 3. Configure CRON Job

Make a build_object with cr_build_schedule_http() and cr_dchedule.

# make a build object
build_rmd <- cr_build_make(rmd_config , source = storage)

# cloud scheduler http endpoint
build_rmd_http <- cr_build_schedule_http(build_rmd)

# Launch cron
cr_schedule("30 0/12 * * *", name = "build_rmd", httpTarget = build_rmd_http)


This build will therefore run every 12.5 hours, i.e. 15 minutes after the product table consolidation stage done in the part 2 of this post.

For this third part. We learned how to :

• use the cr_buildstep function with gsutil
• apply the process to build its build as in step 1 and 2
• build an automated pipeline