A blog post describing the first steps of data cleaning and analysis using R.
Note:
In this and future articles, you will see some arrows below R code. If you click on it, it will display the Stata code equivalent to the R code displayed. However, since those are two different softwares, they are not completely equivalent and some of the Stata code may not fully correspond to the R code. Consider it more like a reference point not to be lost rather than like an exact equivalent.
In this post, you will see how to import and treat data, make descriptive statistics and a few plots. I will also show you a personal method to organize one’s work.
First of all, you need to create a project. In RStudio, you can do “File”, “New Project” and then choose the location of the project and its name. In the folder that contains the project, I have several sub-folders: Figures, Bases_Used, Bases_Created. To be able to save or use files in these particular sub-folders, I use the package here
. The command here()
shows the path to your project and you just need to complete the path to access to your datasets or other files.
Why is this package important? Your code must be reproducible, either for your current collaborators to work efficiently with you or for other people to check your code and to use it in the future. Using paths that work only for your computer (like “/home/Mr X/somefolder/somesubfolder/Project”) makes it longer and more annoying to use your code since it requires to manually change paths in order to import data or other files. The package here
makes it much easier to reproduce your code since it automatically detects the path to access to your data. You only need to keep the same structure between R files and datasets. You will see in the next part how to use it.
We will use data contained in Excel (.xlsx
) and text (.txt
) files. You can find these files (and the full R script corresponding to this post) here. To import Excel data, we will need the readxl
package.
We use the read_excel
function of this package to import excel files and the function read.table
(in base R) to import the data:
base1 <- read_excel(here("Bases_Used/Base_Excel.xlsx"), sheet = "Base1")
base2 <- read_excel(here("Bases_Used/Base_Excel.xlsx"), sheet = "Base2")
base3 <- read_excel(here("Bases_Used/Base_Excel.xlsx"), sheet = "Base3")
base4 <- read.table(here("Bases_Used/Base_Text.txt"), header = TRUE)
"/path/to/Bases_Used"
cd using Base_Excel, sheet("Base1") firstrow import excel
As you can see, if your project is in a folder and if you stored you datasets in the Bases_Used subfolder, this code will work automatically since here
detects the path. Now, we have stored the four datasets in four objects called data.frames
. To me, this simple thing is an advantage on Stata where storing multiple datasets in the same time is not intuitive at all.
We want to have a unique dataset to make descriptive statistics and econometrics (we will just do descriptive statistics in this post). Therefore, we will merge these datasets together, first by using the dplyr
package. This package is one of the references for data manipulation. It is extremely useful and much more easy to use than base R. You may find a cheatsheet (i.e. a recap of the functions) for this package here, along with cheatsheets of many other great packages.
First, we want to regroup base1
and base2
. To do so, we just need to put one under the other and to “stick” them together with bind_rows
and we observe the result:
# A tibble: 23 x 6
hhid indidy1 surname name gender wage
<dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 1 BROWN Robert 1 2000
2 1 2 JONES Michael 1 2100
3 1 3 MILLER William 1 2300
4 1 4 DAVIS David 1 1800
5 2 1 RODRIGUEZ Mary 2 3600
6 2 2 MARTINEZ Patricia 2 3500
7 2 3 WILSON Linda 2 1900
8 2 4 ANDERSON Richard 1 1900
9 3 1 THOMAS Charles 1 1800
10 3 2 TAYLOR Barbara 2 1890
# … with 13 more rows
preserve
rows
*** Open base #2 and bind the clear all
using Base_Excel, sheet("Base2") firstrow
import excel tempfile base2
save `base2'
restore
append using `base2'
As you can see, we obtain a dataframe with 6 columns (like each table separately) and 23 rows: 18 in the first table, 5 in the second table. Now, we merge this dataframe with base3
. base_created
and base3
only have one column in common (hhid
) so we will need to specify that we want to merge these two bases by this column:
base_created <- left_join(base_created, base3, by = "hhid")
base_created
# A tibble: 23 x 7
hhid indidy1 surname name gender wage location
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 1 1 BROWN Robert 1 2000 France
2 1 2 JONES Michael 1 2100 France
3 1 3 MILLER William 1 2300 France
4 1 4 DAVIS David 1 1800 France
5 2 1 RODRIGUEZ Mary 2 3600 England
6 2 2 MARTINEZ Patricia 2 3500 England
7 2 3 WILSON Linda 2 1900 England
8 2 4 ANDERSON Richard 1 1900 England
9 3 1 THOMAS Charles 1 1800 Spain
10 3 2 TAYLOR Barbara 2 1890 Spain
# … with 13 more rows
preserve
merge
*** Open base #3 and clear all
cd ..\Bases_Used using Base_Excel, sheet("Base3") firstrow
import excel tempfile base3
save `base3'
restore
merge m:1 hhid using `base3'
drop _merge
left_join
is a dplyr
function saying that the first dataframe mentioned (here base_created
) is the “most important” and that we will stick the second one (here base3
) to it. If there are more rows in the first one than in the second one, then there will be some missing values but the number of rows will stay the same. If we knew that base3
had more rows than base_created
, we would have used right_join
.
We now want to merge base_created
with base4
. The problem is that there are no common columns so we will need to create one in each. Moreover, base_created
contains data for the year 2019 and base4
for the year 2020. We will need to create columns to specify that too:
From this point, we can merge these two dataframes:
base_created2 <- bind_rows(base_created, base4)
base_created2
# A tibble: 46 x 8
hhid indid surname name gender wage location year
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 1 1 BROWN Robert 1 2000 France 2019
2 1 2 JONES Michael 1 2100 France 2019
3 1 3 MILLER William 1 2300 France 2019
4 1 4 DAVIS David 1 1800 France 2019
5 2 1 RODRIGUEZ Mary 2 3600 England 2019
6 2 2 MARTINEZ Patricia 2 3500 England 2019
7 2 3 WILSON Linda 2 1900 England 2019
8 2 4 ANDERSON Richard 1 1900 England 2019
9 3 1 THOMAS Charles 1 1800 Spain 2019
10 3 2 TAYLOR Barbara 2 1890 Spain 2019
# … with 36 more rows
rename indidy1 indid
gen year=2019
preserve
merge
* Open base #4 and clear all
import delimited Base_Text.txt rename indidy2 indid
gen year=2020
tempfile base4
save `base4'
restore
merge 1:1 hhid indid year using `base4'
drop _merge
But we have many missing values for the new rows because base4
only contained three columns. We want to have a data frame arranged by household then by individual and finally by year. Using only dplyr
functions, we can do:
base_created2 <- base_created2 %>%
group_by(hhid, indid) %>%
arrange(hhid, indid, year) %>%
ungroup()
base_created2
# A tibble: 46 x 8
hhid indid surname name gender wage location year
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 1 1 BROWN Robert 1 2000 France 2019
2 1 1 <NA> <NA> NA 2136 <NA> 2020
3 1 2 JONES Michael 1 2100 France 2019
4 1 2 <NA> <NA> NA 2362 <NA> 2020
5 1 3 MILLER William 1 2300 France 2019
6 1 3 <NA> <NA> NA 2384 <NA> 2020
7 1 4 DAVIS David 1 1800 France 2019
8 1 4 <NA> <NA> NA 2090 <NA> 2020
9 2 1 RODRIGUEZ Mary 2 3600 England 2019
10 2 1 <NA> <NA> NA 3784 <NA> 2020
# … with 36 more rows
Notice that there are some %>%
between the lines: it is a pipe and its function is to connect lines of code between them so that we don’t have to write base_created2
every time. Now that our dataframe is arranged, we need to fill the missing values. Fortunately, these missing values do not change for an individual since they concern the gender, the location, the name and the surname. So basically, we can just take the value of the cell above (corresponding to year 2019) and replicate it in each cell (corresponding to year 2020):
foreach x of varlist surname name gender location {
bysort hhid indid: replace `x'=`x'[_n-1] if year==2020
}
Let me explain the code above:
fill
aims to fill cellsselect_if
selects columns according to the condition definedany(is.na(.))
is a logical question asking if there are missing values (NA).
indicates that we want to apply the function to the whole dataframenames
tells us what the names of the columns selected are.direction
tells the direction in which the filling goesSo fill(select_if(., ~ any(is.na(.))) %>% names(), .direction = 'down')
means that for the dataframe, we select each column which has some NA in it and we obtain their names. In these columns, the empty cells are filled by the value of the cell above (since the direction is “down”).
Finally, we want the first three columns to be hhid
, indid
and year
, and we create a ID column named hhind
which is just the union of hhid
and indid
.
base_created2 <- base_created2 %>%
select(hhid, indid, year, everything()) %>%
unite(hhind, c(hhid, indid), sep = "", remove = FALSE)
base_created2
# A tibble: 46 x 9
hhind hhid indid year surname name gender wage location
<chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 11 1 1 2019 BROWN Robert 1 2000 France
2 11 1 1 2020 BROWN Robert 1 2136 France
3 12 1 2 2019 JONES Michael 1 2100 France
4 12 1 2 2020 JONES Michael 1 2362 France
5 13 1 3 2019 MILLER William 1 2300 France
6 13 1 3 2020 MILLER William 1 2384 France
7 14 1 4 2019 DAVIS David 1 1800 France
8 14 1 4 2020 DAVIS David 1 2090 France
9 21 2 1 2019 RODRIGUEZ Mary 2 3600 England
10 21 2 1 2020 RODRIGUEZ Mary 2 3784 England
# … with 36 more rows
egen hhind=group(hhid indid)
order hhind hhid indid year *
sort hhid indid year
That’s it, we now have the complete dataframe.
There are still some things to do. First, we remark that there are some errors in the column location
(England_error
and Spain_error
) so we correct it:
# display the unique values of the column "location"
unique(base_created2$location)
[1] "France" "England" "Spain" "Italy"
[5] "England_error" "Spain_error"
# correct the errors
base_created2[base_created2 == "England_error"] <- "England"
base_created2[base_created2 == "Spain_error"] <- "Spain"
unique(base_created2$location)
[1] "France" "England" "Spain" "Italy"
replace localisation="England" if localisation=="England_error"
replace localisation="Spain" if localisation=="Spain_error"
Basically, what we’ve done here is that we have selected every cell in the whole dataframe that had the value England_error
(respectively Spain_error
) and we replaced these cells by England
(Spain
). We also need to recode the column gender
because binary variables have to take values of 0 or 1, not 1 or 2.
base_created2$gender <- recode(base_created2$gender, `2` = 0)
label define genderlab 1 "M" 2 "F"
label values gender genderlab
recode gender (2=0 "Female") (1=1 "Male"), gen(gender2)
drop gender
rename gender2 gender
To have more details on the dataframe, we need to create some labels. To do so, we need the upData
function in the Hmisc
package.
library(Hmisc)
var.labels <- c(hhind = "individual's ID",
hhid = "household's ID",
indid = "individual's ID in the household",
year = "year",
surname = "surname",
name = "name",
gender = "1 if male, 0 if female",
wage = "wage",
location = "household's location")
base_created2 <- upData(base_created2, labels = var.labels)
label variable hhind "individual's ID"
label variable indid "household's ID"
label variable year "year"
label variable hhid "individual's ID in the household"
label variable surname "Surname"
label variable name "Name"
label variable gender "1 if male, 0 if female"
label variable wage "wage"
label variable location "household's location"
We can see the result with:
contents(base_created2)
Data frame:base_created2 46 observations and 9 variables Maximum # NAs:0
Labels Class Storage
hhind individual's ID character character
hhid household's ID integer integer
indid individual's ID in the household integer integer
year year integer integer
surname surname character character
name name character character
gender 1 if male, 0 if female integer integer
wage wage integer integer
location household's location character character
Now that our dataframe is clean and detailed, we can compute some descriptive statistics. But before doing it, we might want to save it:
write.xlsx(base_created2, file = here("Bases_Created/modified_base.xlsx")
cd ..\Bases_Created export excel using "modified_base.xls", replace
First of all, if we want to check the number of people per location or gender and per year, we use the table
function:
table(base_created2$gender, base_created2$year)
2019 2020
0 9 9
1 14 14
table(base_created2$location, base_created2$year)
2019 2020
England 6 6
France 12 12
Italy 1 1
Spain 4 4
tab gender if year==2019
tab location if year==2019
To have more detailed statistics, you can use many functions. Here, we use the function describe
from the Hmisc
package
describe(base_created2)
base_created2
9 Variables 46 Observations
----------------------------------------------------------------------
hhind : individual's ID
n missing distinct
46 0 23
lowest : 11 12 13 14 21, highest: 71 72 81 82 83
----------------------------------------------------------------------
hhid : household's ID
n missing distinct Info Mean Gmd
46 0 8 0.975 4.217 2.783
lowest : 1 2 3 4 5, highest: 4 5 6 7 8
Value 1 2 3 4 5 6 7 8
Frequency 8 8 4 2 10 4 4 6
Proportion 0.174 0.174 0.087 0.043 0.217 0.087 0.087 0.130
----------------------------------------------------------------------
indid : individual's ID in the household
n missing distinct Info Mean Gmd
46 0 5 0.923 2.217 1.306
lowest : 1 2 3 4 5, highest: 1 2 3 4 5
Value 1 2 3 4 5
Frequency 16 14 8 6 2
Proportion 0.348 0.304 0.174 0.130 0.043
----------------------------------------------------------------------
year
n missing distinct Info Mean Gmd
46 0 2 0.75 2020 0.5111
Value 2019 2020
Frequency 23 23
Proportion 0.5 0.5
----------------------------------------------------------------------
surname
n missing distinct
46 0 23
lowest : ANDERSON BROWN DAVIS DOE JACKSON
highest: THOMAS THOMPSON WHITE WILLIAMS WILSON
----------------------------------------------------------------------
name
n missing distinct
46 0 23
lowest : Barbara Charles Daniel David Donald
highest: Richard Robert Susan Thomas William
----------------------------------------------------------------------
gender : 1 if male, 0 if female
n missing distinct Info Sum Mean Gmd
46 0 2 0.715 28 0.6087 0.487
----------------------------------------------------------------------
wage
n missing distinct Info Mean Gmd .05
46 0 37 0.998 2059 477.4 1627
.10 .25 .50 .75 .90 .95
1692 1800 1901 2098 2373 3575
lowest : 1397 1600 1608 1683 1690, highest: 2384 3500 3600 3782 3784
----------------------------------------------------------------------
location : household's location
n missing distinct
46 0 4
Value England France Italy Spain
Frequency 12 24 2 8
Proportion 0.261 0.522 0.043 0.174
----------------------------------------------------------------------
sum *, detail
but you can also try the function summary
(automatically available in base R), stat.desc
in pastecs
, skim
in skimr
or even makeDataReport
in dataMaid
to have a complete PDF report summarizing your data. To summarize data under certain conditions (e.g. to have the average wage for each location), you can use dplyr
:
# you can change the argument in group_by() by gender for example
base_created2 %>%
group_by(location) %>%
summarize_at(.vars = "wage", .funs = "mean")
# A tibble: 4 x 2
location wage
<labelled> <dbl>
1 England 2452.
2 France 1935.
3 Italy 1801
4 Spain 1905.
tabstat wage if year==2019, stats(N mean sd min max p25 p50 p75) by(location)
tabstat wage if year==2020, stats(N mean sd min max p25 p50 p75) by(location)
Finally, we want to plot some data to include in our report or article (or anything else). ggplot2
is THE reference to make plots with R. The ggplot
function does not create a graph but tells what is the data you are going to use and the aesthetics (aes
). Here, we want to display the wages in a histogram and to distinguish them per year. Therefore, we want to fill the bars according to the year. To precise the type of graph we want, we add + geom_histogram()
after ggplot
. You may change the number of bins
to have a more precise histogram.
histogram wage if year==2019, saving(Hist1, replace) bin(10) freq title("Year 2019") ytitle("Frequency")
histogram wage if year==2020, saving(Hist2, replace) bin(10) freq title("Year 2020") ytitle("Frequency")
If you prefer one histogram per year, you can use the facet_wrap()
argument, as below.
hist2 <- ggplot(data = base_created2,
mapping = aes(wage, fill = factor(year))) +
geom_histogram(bins = 10) +
facet_wrap(vars(year))
hist2
graph combine Hist1.gph Hist2.gph, col(2) xsize(10) ysize(5) iscale(1.5) title("{bf:Wage distribution per year}")
Finally, you may want to export these graphs. To do so, we use ggsave
(you can replace .pdf by .eps or .png if you want):
graph export Histogram1.pdf, replace
That’s it! In this first post, you have seen how to import, clean and tidy datasets, and how to make some descriptive statistics and some plots. I hope this was helpful to you!
This is my session info, so that you can see the versions of packages used. This is useful if the results in my post are no longer reproducible because packages changed. The packages with a star (*) are those explicitely called in the script.
─ Session info ─────────────────────────────────────────────────────
setting value
version R version 4.0.4 (2021-02-15)
os Ubuntu 18.04.5 LTS
system x86_64, linux-gnu
ui X11
language en
collate fr_FR.UTF-8
ctype fr_FR.UTF-8
tz Europe/Paris
date 2021-03-16
─ Packages ─────────────────────────────────────────────────────────
package * version date lib
assertthat 0.2.1 2019-03-21 [1]
backports 1.2.1 2020-12-09 [1]
base64enc 0.1-3 2015-07-28 [1]
bslib 0.2.4 2021-01-25 [1]
cellranger 1.1.0 2016-07-27 [1]
checkmate 2.0.0 2020-02-06 [1]
cli 2.3.1 2021-02-23 [1]
cluster 2.1.1 2021-02-14 [4]
colorspace 2.0-0 2020-11-11 [1]
crayon 1.4.1 2021-02-08 [1]
data.table 1.13.6 2020-12-30 [1]
DBI 1.1.1 2021-01-15 [1]
debugme 1.1.0 2017-10-22 [1]
digest 0.6.27 2020-10-24 [1]
distill 1.2.2 2021-03-04 [1]
downlit 0.2.1 2020-11-04 [1]
dplyr * 1.0.5 2021-03-05 [1]
ellipsis 0.3.1 2020-05-15 [1]
evaluate 0.14 2019-05-28 [1]
fansi 0.4.2 2021-01-15 [1]
farver 2.0.3 2020-01-16 [1]
foreign 0.8-81 2020-12-22 [4]
Formula * 1.2-4 2020-10-16 [1]
generics 0.1.0 2020-10-31 [1]
ggplot2 * 3.3.3 2020-12-30 [1]
glue 1.4.2 2020-08-27 [1]
gridExtra 2.3 2017-09-09 [1]
gtable 0.3.0 2019-03-25 [1]
here * 1.0.1 2020-12-13 [1]
highr 0.8 2019-03-20 [1]
Hmisc * 4.4-2 2020-11-29 [1]
htmlTable 2.1.0 2020-09-16 [1]
htmltools 0.5.1.1 2021-01-22 [1]
htmlwidgets 1.5.3 2020-12-10 [1]
jpeg 0.1-8.1 2019-10-24 [1]
jquerylib 0.1.3 2020-12-17 [1]
jsonlite 1.7.2 2020-12-09 [1]
knitr 1.31 2021-01-27 [1]
labeling 0.4.2 2020-10-20 [1]
lattice * 0.20-41 2020-04-02 [4]
latticeExtra 0.6-29 2019-12-19 [1]
lifecycle 1.0.0 2021-02-15 [1]
magrittr 2.0.1 2020-11-17 [1]
Matrix 1.3-2 2021-01-06 [4]
munsell 0.5.0 2018-06-12 [1]
nnet 7.3-15 2021-01-24 [4]
pillar 1.5.1 2021-03-05 [1]
pkgconfig 2.0.3 2019-09-22 [1]
png 0.1-7 2013-12-03 [1]
purrr 0.3.4 2020-04-17 [1]
R6 2.5.0 2020-10-28 [1]
RColorBrewer 1.1-2 2014-12-07 [1]
Rcpp 1.0.6 2021-01-15 [1]
readxl * 1.3.1 2019-03-13 [1]
rlang 0.4.10 2020-12-30 [1]
rmarkdown 2.6.6 2021-02-08 [1]
rpart 4.1-15 2019-04-12 [4]
rprojroot 2.0.2 2020-11-15 [1]
rstudioapi 0.13 2020-11-12 [1]
sass 0.3.1 2021-01-24 [1]
scales 1.1.1 2020-05-11 [1]
sessioninfo 1.1.1 2018-11-05 [1]
stringi 1.5.3 2020-09-09 [1]
stringr 1.4.0 2019-02-10 [1]
survival * 3.2-7 2020-09-28 [4]
tibble 3.1.0 2021-02-25 [1]
tidyr * 1.1.2 2020-08-27 [1]
tidyselect 1.1.0 2020-05-11 [1]
utf8 1.2.1 2021-03-12 [1]
vctrs 0.3.6 2020-12-17 [1]
withr 2.4.1 2021-01-26 [1]
xfun 0.20 2021-01-06 [1]
yaml 2.2.1 2020-02-01 [1]
source
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.0)
CRAN (R 4.0.4)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.2)
CRAN (R 4.0.3)
local
CRAN (R 4.0.3)
CRAN (R 4.0.4)
CRAN (R 4.0.0)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.2)
CRAN (R 4.0.0)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.0)
CRAN (R 4.0.4)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.4)
CRAN (R 4.0.0)
CRAN (R 4.0.0)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
Github (rstudio/rmarkdown@de0e2ec)
CRAN (R 4.0.0)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.4)
CRAN (R 4.0.0)
CRAN (R 4.0.0)
CRAN (R 4.0.2)
CRAN (R 4.0.0)
CRAN (R 4.0.2)
CRAN (R 4.0.4)
CRAN (R 4.0.2)
CRAN (R 4.0.0)
CRAN (R 4.0.4)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.3)
CRAN (R 4.0.0)
[1] /home/etienne/R/x86_64-pc-linux-gnu-library/4.0
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/etiennebacher/personal_website_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bacher (2020, Jan. 22). Etienne Bacher: First contact with the data on R. Retrieved from https://www.etiennebacher.com/posts/2020-01-22-first-contact/
BibTeX citation
@misc{bacher2020first, author = {Bacher, Etienne}, title = {Etienne Bacher: First contact with the data on R}, url = {https://www.etiennebacher.com/posts/2020-01-22-first-contact/}, year = {2020} }