Data Frame

Published onesixx on

 

  DataCamp Blog , https://www.r-bloggers.com/15-easy-solutions-to-your-data-frame-problems-in-r/ 번역

 

Data Frame?

 

class(warpbreaks)
str(warpbreaks);head(warpbreaks)
[1] "data.frame"


'data.frame':	54 obs. of  3 variables:
 $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
 $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...


  breaks wool tension
1     26    A       L
2     30    A       L
3     54    A       L
4     25    A       L
5     70    A       L
6     52    A       L

matrix와 비슷하지만,  각 column의 데이터 type이 같을 필요는 없다. 
list와 비슷하지만, 길이가 같은 벡터의 구조로, List의 특별한 형태라고 보면 된다.  쉽게 같은 길이의 하나의 spreadsheet.
https://www.youtube.com/watch?v=I1plgNz1bVI

 

 

The Basics Of Data Frames: The Questions And Solutions

간단한 dataFrame만들기

tutorial,  Rdocumentation page on read.table.

 

vectors를 먼저 만들고, data.frame()함수로  그 벡터을 조합한다. 
주의할 것은 모든 벡터는 길이가 같아야하고, 문자변수들은 무조건  factor(범주형 변수)로 가져온다. 

      Died.At <- c(22, 40, 72, 41)
    Writer.At <- c(16, 18, 36, 36)
   First.Name <- c("John", "Edgar", "Walt", "Jane")
  Second.Name <- c("Doe", "Poe", "Whitman", "Austen")
          Sex <- c("MALE", "MALE", "MALE", "FEMALE")
Date.Of.Death <- c("2015-05-10", "1849-10-07", "1892-03-26", "1817-07-18")

writers_df <- data.frame(Died.At, Writer.At, First.Name, Second.Name, Sex, Date.Of.Death)

str(writers_df)
head(writers_df); tail(writers_df)
'data.frame':	4 obs. of  6 variables:
 $ Died.At      : num  22 40 72 41
 $ Writer.At    : num  16 18 36 36
 $ First.Name   : Factor w/ 4 levels "Edgar","Jane",..: 3 1 4 2
 $ Second.Name  : Factor w/ 4 levels "Austen","Doe",..: 2 3 4 1
 $ Sex          : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 1
 $ Date.Of.Death: Factor w/ 4 levels "1817-07-18","1849-10-07",..: 4 2 3 1

 

  •  I()  insulate
  • as.Date() 
      Died.At <- c(22, 40, 72, 41)
    Writer.At <- c(16, 18, 36, 36)
#  First.Name <- c("John", "Edgar", "Walt", "Jane")
   First.Name <- I(c("John", "Edgar", "Walt", "Jane"))
# Second.Name <- c("Doe", "Poe", "Whitman", "Austen")
  Second.Name <- I(c("Doe", "Poe", "Whitman", "Austen"))
          Sex <- c("MALE", "MALE", "MALE", "FEMALE")
#Date.Of.Death <- c("2015-05-10", "1849-10-07", "1892-03-26", "1817-07-18")
Date.Of.Death <- as.Date(c("2015-05-10", "1849-10-07", "1892-03-26","1817-07-18"))

writers_df <- data.frame(Died.At, Writer.At, First.Name, Second.Name, Sex, Date.Of.Death)
str(writers_df)
'data.frame':	4 obs. of  6 variables:
 $ Died.At      : num  22 40 72 41
 $ Writer.At    : num  16 18 36 36
 $ First.Name   :Class 'AsIs'  chr [1:4] "John" "Edgar" "Walt" "Jane"
 $ Second.Name  :Class 'AsIs'  chr [1:4] "Doe" "Poe" "Whitman" "Austen"
 $ Sex          : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 1
 $ Date.Of.Death: Date, format: "2015-05-10" "1849-10-07" ...

 

DataFrame기초

DataFrame의 행과 열 이름 바꾸기 

위 dataFrame을 활용해 보자

writers_df
print(writers_df)
  Died.At Writer.At First.Name Second.Name    Sex Date.Of.Death
1      22        16       John         Doe   MALE    2015-05-10
2      40        18      Edgar         Poe   MALE    1849-10-07
3      72        36       Walt     Whitman   MALE    1892-03-26
4      41        36       Jane      Austen FEMALE    1817-07-18

DataFrame은 헤더에 해당하는 names 속성을 가진다. 
names() 함수로 헤더를 확인하고 수정이 가능하다.

names(writers_df)
[1] "Died.At"   "Writer.At"  "First.Name"  "Second.Name" "Sex"    "Date.Of.Death"

수정

 

names(writers_df) <- c("Age.At.Death", "Age.As.Writer", "Name", "Surname", "Gender", "Death")
names(writers_df)
[1] "Age.At.Death"  "Age.As.Writer" "Name"  "Surname"    "Gender"   "Death"

colnames() and rownames()함수로 행과 열의 이름을 수정할수 있다. 

colnames(writers_df) = c("Age.At.Death", "Age.As.Writer", "Name", "Surname", "Gender", "Death")
rownames(writers_df) = c("ID1", "ID2", "ID3", "ID4")
    Age.At.Death Age.As.Writer  Name Surname Gender      Death
ID1           22            16  John     Doe   MALE 2015-05-10
ID2           40            18 Edgar     Poe   MALE 1849-10-07
ID3           72            36  Walt Whitman   MALE 1892-03-26
ID4           41            36  Jane  Austen FEMALE 1817-07-18

 

Data Frame의 Dimensions

dim() function

dim(writers_df)
## [1] 4 6

DataFrame의 matrix와 비슷하니..

dim(writers_df)[1] #Number of rows
dim(writers_df)[2] #Number of columns

nrow(writers_df) 
ncol(writers_df)
[1] 4
[1] 6

 

DataFrame의 list의 특별한 형태이니… length() 함수로 열갯수

length(writers_df) 
[1] 6

 

 Data Frame의 Values을 접근/수정

… Through The Variable Names

1) $ 로  값에 접근 

writers_df$Age.As.Writer

 

[1] 16 18 36 36

2) [,] 으로 값에 접근 

writers_df [1:2,3] 
## [1] "John"  "Edgar"

수정은 위 두가지 방법을 사용해 범위를 정하게 Assign함. 

writers_df$Age.At.Death <- writers_df$Age.At.Death-1
writers_df[,1] <- writers_df$Age.At.Death-1
writers_df[1,3] = "Jane"

 

Data Frames을 attach하는 방법

attach() /detach()함수를 사용하면, $를 사용하면서 매번 dataframe이름을 쓰는 번거로움을 피할수 있다. 

search()
 [1] ".GlobalEnv"        "tools:rstudio"     "package:stats"     "package:graphics"  "package:grDevices"
 [6] "package:utils"     "package:datasets"  "package:methods"   "Autoloads"         "package:base" 
attach(writers_df)
> search()
 [1] ".GlobalEnv"        "writers_df"        "tools:rstudio"     "package:stats"     "package:graphics" 
 [6] "package:grDevices" "package:utils"     "package:datasets"  "package:methods"   "Autoloads"        
[11] "package:base"  

attach() 함수 대신에 with() 함수를 사용할수도 있다. 

with(writers_df, c("Age.At.Death", "Age.As.Writer", "Name", "Surname", "Gender", "Death"))
[1] "Age.At.Death"  "Age.As.Writer" "Name"          "Surname"      "Gender"        "Death"

If you get an error that tells you that “The following objects are masked by .GlobalEnv:”, this is because you have objects in your global environment that have the same name as your data frame. Those objects could be the vectors that you created above, if you didn’t change their names. You have two solutions to this:

  1. You just don’t create any objects with those names in your global environment. This is more a solution for those of you who imported their data through read.table(), read.csv() or read.delim(), but not really appropriate for this case.
  2. You rename the objects in the data frame so that there’s no conflict. This is the solution that was applied in this tutorial. So, rename your columns with the names() or colnames() functions.

Note that if all else fails, you can just remember to always refer to your data frame’s column names with the $ notation!

Data Frames에 함수를 apply하는 방법

Intermediate R course

하나의 dataframe에서 다른 dataframe뽑아내기

Ages <- writers_df[,1:2]

apply() 함수를 이용하여, 평균값을 구해보자. (dataframe, 1행/2열, 구하고자 하는 함수)

apply(Ages, 2, median)
Age.At.Death Age.As.Writer 
        40.5          27.0
apply(Ages,1,median)
 ID1  ID2  ID3  ID4 
19.0 29.0 54.0 38.5

 

 

추가 

 

 

How To Extract Rows And Colums, Subseting Your Data Frame

Subsetting or extracting specific rows and columns from a data frame is an important skill in order to surpass the basics that have been introduced in step two, because it allows you to easily manipulate smaller sets of your original data frame. You basically extract those values from the rows and columns that you need in order to optimize the data analyses you make.

It’s easy to start subsetting with the [,] notation that was described in step two:

writer_names_df <- writers_df[1:4, 3:4]
writer_names_df
##    Name Surname
## 1  Jane     Doe
## 2 Edgar     Poe
## 3  Walt Whitman
## 4  Jane  Austen

Note that you can also define this subset with the variable names:

writer_names_df <- writers_df[1:4, c("Name", "Surname")]

Tip: be careful when you are subsetting just one column!

R has the tendency to simplify your results, which means that it will read your subset as a vector,
which normally, you don’t want to get.
To make sure that this doesn’t happen, you can add the argument drop=FALSE:

writer_names_df <- writers_df[1:4, "Name", drop=FALSE]
str(writer_names_df)
## 'data.frame':    4 obs. of  1 variable:
##  $ Name:Class 'AsIs'  chr [1:4] "Jane" "Edgar" "Walt" "Jane"

In a next step, you can try subsetting with the subset() function:

writer_names_df <- subset(writers_df, Age.At.Death <= 40 & Age.As.Writer >= 18)
writer_names_df
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 2           40            18 Edgar     Poe   MALE 1849-10-07

You can also subset on a particular value:

writer_names_df <- subset(writers_df, Name =="Jane")
writer_names_df
##   Age.At.Death Age.As.Writer Name Surname Gender      Death
## 1           22            16 Jane     Doe FEMALE 2015-05-10
## 4           41            36 Jane  Austen FEMALE 1817-07-18

You can not only subset with the R functions that have been described above. You can also turn to grep() to get the job done. For example, if you want to work with the rows in the column Age.At.Death that have values that contain “4”, you can use the following line of code:

fourty_writers <- writers_df[grep("4", writers_df$Age.At.Death),]
fourty_writers
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 4           41            36  Jane  Austen FEMALE 1817-07-18

Note that by subsetting, you basically stop considering certain values of your data frame. This might mean that you remove certain features of a factor, by, for example, only considering the MALE members of the writers_df data frame. Notice how all factor levels of this column still remain present, even though you have created a subset:

male_writers <- writers_df[Gender =="MALE",]
str(male_writers)
## 'data.frame':    0 obs. of  6 variables:
##  $ Age.At.Death : num 
##  $ Age.As.Writer: num 
##  $ Name         :Class 'AsIs'  chr(0) 
##  $ Surname      :Class 'AsIs'  chr(0) 
##  $ Gender       : Factor w/ 2 levels "FEMALE","MALE": 
##  $ Death        :Class 'Date'  num(0)

To remove the factor levels that are no longer present, you can enter the following line of code:

factor(Gender)
## factor(0)
## Levels:

How To Remove Columns And Rows From A Data Frame

If you want to remove values or entire columns from your data frame, you can assign a NULL value to the desired unit:

writers_df[1,3] <- NULL
Age.At.Death <- NULL

To remove rows, the procedure is a bit more complicated. You define a new vector in which you list for every row whether to have it included or not. Then, you apply this vector to your data frame:

rows_to_keep <- c(TRUE, FALSE, TRUE, FALSE)
limited_writers_df <- writers_df[rows_to_keep,]
limited_writers_df
##   Age.At.Death Age.As.Writer Name Surname Gender      Death
## 1           22            16 Jane     Doe FEMALE 2015-05-10
## 3           72            36 Walt Whitman   MALE 1892-03-26

Note that you can also do the opposite by just adding !, stating that the reverse is true:

less_writers_df <- writers_df[!rows_to_keep,]
less_writers_df
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 4           41            36  Jane  Austen FEMALE 1817-07-18

You can also work with thresholds. For example, you can specify that you only want to keep all writers that were older than fourty when they died:

fourty_sth_writers <- writers_df[writers_df$Age.At.Death > 40,]
fourty_sth_writers
##   Age.At.Death Age.As.Writer Name Surname Gender      Death
## 3           72            36 Walt Whitman   MALE 1892-03-26
## 4           41            36 Jane  Austen FEMALE 1817-07-18

How To Add Rows And Columns To A Data Frame

Much in the same way that you used the [,] and $ notations to access and change single values of your data frame, you can also easily add columns to your data frame:

writers_df$Location <- c("Belgium", "United Kingdom", "United States", "United Kingdom")

Appending rows to an existing data frame is somewhat more complicated. To easily do this by first making a new row in a vector, respecting the column variables that have been defined in writers_df and by then binding this row to the original data frame with the rbind() function:

new_row <- c(50, 22, "Roberto", "Bolano", "MALE", "2003-07-15")
writers_df_large <- rbind(writers_df, new_row)

Why And How To Reshape A Data Frame From Wide To Long Format And Vice Versa

When you have multiple values, spread out over multiple columns, for the same instance, your data is in the “wide” format. On the other hand, when your data is in the “long” format if there is one observation row per variable. You therefore have multiple rows per instance. Let’s illustrate this with an example. Long data looks like this:

Subject <- c(1,2,1,2,2,1)
Gender <- c("M", "F", "M", "F", "F","M")
Test <- c("Read", "Write", "Write", "Listen", "Read", "Listen")
Result <- c(10, 4, 8, 6, 7, 7)
observations_long <- data.frame(Subject, Gender, Test, Result)
observations_long
##   Subject Gender   Test Result
## 1       1      M   Read     10
## 2       2      F  Write      4
## 3       1      M  Write      8
## 4       2      F Listen      6
## 5       2      F   Read      7
## 6       1      M Listen      7

As you can see, there is one row for each value that you have in the Test variable. A lot of statistical tests favor this format.

This data frame would look like the following in the wide format:

Subject <- c(1,2)
Gender <- c("M", "F")
Read <- c(10, 7)
Write <-c(8, 4)
Listen <- c(7, 6)
observations_wide <- data.frame(Subject, Gender, Read, Write, Listen)
observations_wide
##   Subject Gender Read Write Listen
## 1       1      M   10     8      7
## 2       2      F    7     4      6

You see that each column represents a unique pairing of the various factors with the values.

Since different functions may require you to input your data either in “long” or “wide” format, you might need to reshape your data set. There are two main options that you can choose here: you can use the stack() function or you can try using the reshape() function. The former is preferred when you work with simple data frames, while the latter is more often used on more complex data frames, mostly because there’s a difference in the possibilities that both functions offer.

Make sure to keep on reading to know more about the differences in possibilities between the stack() and reshape() functions!

Using stack() For Simply Structured Data Frames

The stack() function basically concatenates or combines multiple vectors into a single vector, along with a factor that indicates where each observation originates from.

To go from wide to long format, you will have to stack your observations, since you want one observation row per variable, with multiple rows per variable. In this case, you want to merge the columns Read, Write and Listen together, qua names and qua values:

long_format <- stack(observations_wide, 
                     select=c(Read, 
                              Write, 
                              Listen))
long_format
##   values    ind
## 1     10   Read
## 2      7   Read
## 3      8  Write
## 4      4  Write
## 5      7 Listen
## 6      6 Listen

To go from long to wide format, you will need to unstack your data, which makes sense because you want to have one row per instance with each value present as a different variable. Note here that you want to disentangle the Result and Test columns:

wide_format <- unstack(observations_long, 
                       Result ~ Test)
wide_format
##   Listen Read Write
## 1      6   10     4
## 2      7    7     8

Using reshape() For Complex Data Frames

This function is part of the stats package. This function is similar to the stack() function, but is a little bit more elaborate. Read and see for yourself how reshaping your data works with the reshape() function:

To go from a wide to a long data format, you can first start off by entering the reshape() function. The first argument should always be your original wide data set. In this case, you can specify that you want to input the observations_wide to be converted to a long data format.

Then, you start adding other arguments to the reshape() function:

  1. Include a list of variable names that define the different measurements through varying. In this case, you store the scores of specific tests in the columns “Read”, “Write” and “Listen”.
  2. Next, add the argumentv.names to specify the name that you want to give to the variable that contains these values in your long dataset. In this case, you want to combine all scores for all reading, writing and listening tests into one variable Score.
  3. You also need to give a name to the variable that describes the different measurements that are inputted with the argument timevar. In this case, you want to give a name to the column that contains the types of tests that you give to your students. That’s why this column’s name should be called “Test”.
  4. Then, you add the argument times, because you need to specify that the new column “Test” can only take three values, namely, the test components that you have stored: “Read”, “Write”, “Listen”.
  5. You’re finally there! Give in the end format for the data with the argument direction.
  6. Additionally, you can specify new row names with the argument new.row.names.

Tip: try leaving out this last argument and see what happens!

library(stats)
long_reshape <- reshape(observations_wide, 
             varying = c("Read", "Write", "Listen"), 
             v.names = "Score",
             timevar = "Test", 
             times = c("Read", "Write", "Listen"),
             direction = "long",
             new.row.names = 1:1000)
long_reshape
##   Subject Gender   Test Score id
## 1       1      M   Read    10  1
## 2       2      F   Read     7  2
## 3       1      M  Write     8  1
## 4       2      F  Write     4  2
## 5       1      M Listen     7  1
## 6       2      F Listen     6  2

From long to wide, you take sort of the same steps. First, you take the reshape() function and give it its first argument, which is the data set that you want to reshape. The other arguments are as follows:

  1. timevar allows you to specify that the variable Test, which describes the different tests that you give to your students, should be decomposed.
  2. You also specify that the reshape() function shouldn’t take into account the variables Subject and Gender of the original data set. You put these column names into idvar.
  3. By not naming the variable Result, the reshape() function will know that both Test and Result should be recombined.
  4. You specify the direction of the reshaping, which is in this case, wide!
wide_reshape <- reshape(observations_long, 
                        timevar = "Test",
                        idvar = c("Subject", "Gender"),
                        direction = "wide")
wide_reshape
##   Subject Gender Result.Read Result.Write Result.Listen
## 1       1      M          10            8             7
## 2       2      F           7            4             6

Note that if you want you can also rename or sort the results of these new long and wide data formats! You can find detailed instructions below.

Reshaping Data Frames With tidyr

This package allows you to “easily tidy data with the spread() and gather() functions” and that’s exactly what you’re going to do if you use this package to reshape your data!

If you want to convert from wide to long format, the principle stays similar to the one that of reshape(): you use the gather() function and you start specifying its arguments:
1. Your data set is the first argument to the gather() function
2. Then, you specify the name of the column in which you will combine the the values of Read, Write and Listen. In this case, you want to call it something like Test or Test.Type.
3. You enter the name of the column in which all the values of the Read, Write and Listen columns are listed.
4. You indicate which columns are supposed to be combined into one. In this case, that will be the columns from Read, to Listen.

library(tidyr)
long_tidyr <- gather(observations_wide, 
                     Test, 
                     Result, 
                     Read:Listen)
long_tidyr
##     Subject Gender   Test Result
## 1       1      M   Read     10
## 2       2      F   Read      7
## 3       1      M  Write      8
## 4       2      F  Write      4
## 5       1      M Listen      7
## 6       2      F Listen      6

Note how this the last argument specifies the columns in the same way as you did to subset your data frame or to select your data frame’s columns in which you wanted to perform mathematical operations. You can also just specify the columns individually like this:

long_tidyr <- gather(observations_wide, 
                     Test, 
                     Result, 
                     Read, 
                     Write, 
                     Listen)

The opposite direction, from long to wide format, is very similar to the function above, but this time with the spread() function:

library(tidyr)
wide_tidyr <- spread(observations_long, 
                     Test, 
                     Result)
wide_tidyr
##    Subject Gender Listen Read Write
## 1       1      M      7   10     8
## 2       2      F      6    7     4

Again, you take as the first argument your data set. Then, you specify the column that contains the new column names. In this case, that is Test. Lastly, you input the name of the column that contains the values that should be put into the new columns.

Tip: take a look at the “Data Wrangling With dplyr And tidyr Cheat Sheet” for a complete overview of the possibilities that these packages can offer you to wrangle your data!

Reshaping Data Frames With reshape2

This package, which allows you to “flexibly reshape data”, actually has very straightforward ways of reshaping your data frame.

To go from a wide to a long data format, you use the melt() function. This function is pretty easy, since it just takes your data set and the id.vars argument, which you may already know from the reshape() function. This argument allows you to specify which columns should be left alone by the function.

library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
long_reshaped2 <- melt(observations_wide, 
                       id.vars=c("Subject", "Gender"))
long_reshaped2
##   Subject Gender variable value
## 1       1      M     Read    10
## 2       2      F     Read     7
## 3       1      M    Write     8
## 4       2      F    Write     4
## 5       1      M   Listen     7
## 6       2      F   Listen     6

Note that this function allows you to specify a couple more arguments:

library(reshape2)
long_reshaped2 <- melt(observations_wide, 
                       id.vars=c("Subject", "Gender"),
                       measure.vars=c("Read", "Write", "Listen"),
                       variable.name="Test",
                       value.name="Result")
long_reshaped2
##   Subject Gender   Test Result
## 1       1      M   Read     10
## 2       2      F   Read      7
## 3       1      M  Write      8
## 4       2      F  Write      4
## 5       1      M Listen      7
## 6       2      F Listen      6
  • measure.vars is there to name the destination column that will combine the original columns. If you leave out this argument, the melt() function will use all other variables as the id.vars.
  • variable.name specifies how you want to name that destination column. If you don’t specify this argument, you will have a column named “variable” in your result.
  • value.name allows you to input the name of the column in which the values or test results will be stored. If you leave out this argument, this column will be named “measurement”.

You can also go from a long to a wide format with the reshape2 package with the dcast() function. This is fairly easy: you first give in your data set, as always. Then, you combine the columns which you don’t want to be touched; In this case, you want to keep Subject and Gender as they are. The column Test however, you want to split! So, that is the second part of your second argument, indicated by a ~. The last argument of this function is value.var, which holds the values of the different tests. You want to name this new column Result:

library(reshape2)
long_reshaped2 <- dcast(observations_long, 
                        Subject + Gender ~ Test, 
                        value.var="Result")
long_reshaped2
##   Subject Gender Listen Read Write
## 1       1      M      7   10     8
## 2       2      F      6    7     4

How To Sort A Data Frame

Sorting a data frame by columns might seem tricky, but this can be made easy by either using R’s built-in order() function or by using a package.

R’s Built-In Order() Function

You can for example sort by one of the dataframe’s columns. You order the rows of the data frame according to the values that are stored in the variable Age.As.Writer:

writers_df[order(Age.As.Writer),]
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  Jane     Doe FEMALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18

If you want to sort the values starting from high to low, you can just add the extra argument decreasing, which can only take logical values.

Remember that logical values are TRUE or FALSE, respectively.

Another way is to add the function rev() so that it includes the order() function. As the function’s name suggests, it provides a way to give you the reversed version of its argument, which is order(Name) in this case:

writers_df[order(Age.As.Writer, decreasing=TRUE),]
writers_df[rev(order(Age.As.Writer)),]

You can also add a – in front of the numeric variable that you have given to order on.

writers_df[order(-Age.As.Writer),]
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 1           22            16  Jane     Doe FEMALE 2015-05-10

Sorting With dplyr

The dplyr package, known for its abilities to manipulate data, has a specific function that allows you to sort rows by variables.

Dplyr’s function to make this happen is arrange(). The first argument of this function is the data set that you want to sort, while the second and third arguments are the variables that you choose to sort. In this case we sort first on the variable Age.At.Death and then on Age.As.Writer.

data2 <- arrange(writers_df, Age.At.Death, Age.As.Writer)
data2
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  Jane     Doe FEMALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           41            36  Jane  Austen FEMALE 1817-07-18
## 4           72            36  Walt Whitman   MALE 1892-03-26

You can also use the following approach to get the same result:

writers_df[with(writers_df, order(Age.At.Death, Age.As.Writer)), ]

If you want to sort these columns in descending order, you can add the function desc() to the variables:

desc_sorted_data <- arrange(writers_df, desc(Age.At.Death))

Interested in doing much more with the dplyr package? Check out our Data Manipulation in R with dplyr course, which will teach you how to to perform sophisticated data manipulation tasks using dplyr! Also, don’t forget to look at the “Data Wrangling With dplyr And tidyr Cheat Sheet”!

How To Merge Data Frames

Merging Data Frames On Column Names

You can use the merge() function to join two, but only two, data frames. Let’s say we have a data frame data2, which has the same values stored in a variable Age.At.Death, which we also find in writers_df, with exactly the same values. You thus want to merge the two data frames on the basis of this variable:

data2 <- data.frame(Age.At.Death=c(22,40,72,41), Location=5:8)

We can easily merge these two:

new_writers_df <- merge(writers_df, data2)
new_writers_df
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 1           22            16  Jane     Doe FEMALE 2015-05-10        5
## 2           40            18 Edgar     Poe   MALE 1849-10-07        6
## 3           41            36  Jane  Austen FEMALE 1817-07-18        8
## 4           72            36  Walt Whitman   MALE 1892-03-26        7

Tip: check what happens if you change the order of the two arguments of the merge() function!

This way of merging is equivalent to an outer join in SQL.

Unfortunately, you’re not always this lucky with your data frames. In many cases, some of the columns names or variable values will differ, which makes it hard to follow the easy, standard procedure that was described just now. In addition, you may not always want to merge in the standard way that was described above. In the following, some of the most common issues are listed and solved!

What If… (Some Of) The Data Frame’s Column Values Are Different?

If (some of) the values of the variable on which you merge differ in the data frames, you have a small problem, because the merge() function supposes that these are the same so that any new variables that are present in the second data frame can be added correctly to the first data frame. Consider the following data frame:

data2 <- data.frame(x.Age.At.Death=c(21,39,71,40), Location=5:8)

You see that the values for the attribute Age.At.Death do not fit with the ones that were defined for the writers_df data frame.

No worries, the merge() function provides extra arguments to solve this problem. The argument all.x allows you to specify to add the extra rows of the Location variable to the resulting data frame, even though this column is not present in writers_df. In this case, the values of the Location variable will be added to the writers_df data frame for those rows of which the values of the Age.At.Death attribute correspond. All rows where the Age.At.Death of the two data frames don’t correspond, will be filled up with NA values.
Note that this join corresponds to a left outer join in SQL and that the default value of the all.x argument is FALSE, which means that one normally only takes into account the corresponding values of the merging variable. Compare with:

merge(writers_df, data2, all.x=FALSE)

You can also specify the argument all.y=TRUE if you want to add extra rows for each row that data2 has no matching row in writers_df:

merge(writers_df, data2, all.y=TRUE)

Note that this type of join corresponds to a right outer join in SQL.

What If… Both Data Frames Have The Same Column Names?

What if your two data frames have exactly the same two variables, with or without the same values?

data2 <- data.frame(Age.At.Death=c(21,39,71,40), Age.As.Writer=c(11,25,36,28))

You can chose to keep all values from all corresponding variables and to add rows to the resulting data frame:

merge(writers_df, data2, all=TRUE)
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           21            11                
## 2           22            16  Jane     Doe FEMALE 2015-05-10
## 3           39            25                
## 4           40            18 Edgar     Poe   MALE 1849-10-07
## 5           40            28                
## 6           41            36  Jane  Austen FEMALE 1817-07-18
## 7           71            36                
## 8           72            36  Walt Whitman   MALE 1892-03-26

Or you can just chose to add values from one specific variable for which the ages of death correspond:

merge(writers_df, data2, by="Age.At.Death")
##   Age.At.Death Age.As.Writer.x  Name Surname Gender      Death
## 1           40              18 Edgar     Poe   MALE 1849-10-07
##   Age.As.Writer.y
## 1              28

What If… The Data Frames’ Column Names Are Different?

Lastly, what if the variable’s names on which you merge differ in the two data frames?

data2 <- data.frame(Age=c(22,40,72,41), Location=5:8)

You just specify in the merge() function that there are two other specifications through the arguments by.x and by.y.

merge(writers_df, data2, by.x="Age.At.Death", by.y="Age")
#   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 1           22            16  Jane     Doe FEMALE 2015-05-10        5
## 2           40            18 Edgar     Poe   MALE 1849-10-07        6
## 3           41            36  Jane  Austen FEMALE 1817-07-18        8
## 4           72            36  Walt Whitman   MALE 1892-03-26        7

Merging Data Frames On Row Names

You can indeed merge the columns of two data frames, that contain a distinct set of columns but some rows with the same names. The merge() function and its arguments come to the rescue!

Consider this second data frame:

Address <- c("50 West 10th", "77 St. Marks Place", "778 Park Avenue")
Maried <- c("YES", "NO", "YES")
limited_writers_df <- data.frame(Address, Maried)
limited_writers_df
##              Address Maried
## 1       50 West 10th    YES
## 2 77 St. Marks Place     NO
## 3    778 Park Avenue    YES

You see that this data set contains three rows, marked with numbers 1 to 3, and two additional columns that are not in the writers_df data frame. To merge these two data frames, we add the argument by to the merge() function and set it at the number 0, which specifies the row names. Since you choose to keep all values from all corresponding variables and to add columns to the resulting data frame, you set the all argument to TRUE:

writers_row_sorted <- merge(writers_df, limited_writers_df, by=0, all=TRUE)
writers_row_sorted
##   Row.names Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1         1           22            16  Jane     Doe FEMALE 2015-05-10
## 2         2           40            18 Edgar     Poe   MALE 1849-10-07
## 3         3           72            36  Walt Whitman   MALE 1892-03-26
## 4         4           41            36  Jane  Austen FEMALE 1817-07-18
##              Address Maried
## 1       50 West 10th    YES
## 2 77 St. Marks Place     NO
## 3    778 Park Avenue    YES
## 4                  

It could be that the fields for rows that don’t occur in both data frames result in NA-values. You can easily solve this by removing them. This will be discussed below.

How To Remove Data Frames’ Rows And Columns With NA-Values

To remove all rows that contain NA-values, one of the easiest options is to use the na.omit() function, which takes your data frame as an argument. Let’s recycle the code from the previous section in which two data frames were merged, with a lot of resulting NA-values:

data2 <- data.frame(Age.At.Death=c(21,39,71,40), Location=5:8)
merge <- merge(writers_df, data2, all.y=TRUE)
na.omit(merge)
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 3           40            18 Edgar     Poe   MALE 1849-10-07        8

If you just want to select part of your data frame from which you want to remove the NA-values, it’s better to use complete.cases(). In this case, you’re interested to keep all rows for which the values of the columns Age.As.Writer and Name are complete:

data2 <- data.frame(Age.At.Death=c(21,39,71,40), Location=5:8)
merge <- merge(writers_df, data2, all.y=TRUE)
merge[complete.cases(merge[,2:3]),]
##   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 3           40            18 Edgar     Poe   MALE 1849-10-07        8

How To Convert Lists Or Matrices To Data Frames And Back

From Lists or Matrices To Data Frames

Lists or matrices that comply with the restrictions that the data frame imposes can be coerced into data frames with the as.data.frame() function. Remember that a data frame is similar to the structure of a matrix, where the columns can be of different types. Data frames are also similar to lists, where each column is an element of the list and each element has the same length. Any matrices or lists that you want to convert to data frames need to satisfy with these restrictions.

For example, the matrix A can be converted to a data frame because each column contains values of the numeric data type:

A = matrix(c(2, 4, 3, 1, 5, 7), nrow=2, ncol=3, byrow = TRUE) 
A
##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7

You enter the matrix A as an argument to the as.data.frame() function:

A_df <- as.data.frame(A)
A_df
##   V1 V2 V3
## 1  2  4  3
## 2  1  5  7

You can follow the same procedures for lists like the one that is shown below:

n = c(2, 3, 5) 
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
x = list(n, s, b, 3)
x_df <- as.data.frame(x)

Data Frame을 Matrix 또는 List로 바꾸기

To make the opposite move, that is, to convert data frames to matrices and lists, you first have to check for yourself if this is possible.
Does you data frame contain one or more dimensions and what about the amount of data types?
Rewatch the small animation of the introduction if you’re not sure what data structure to pick.

Once you have an answer, you can use the functions as.matrix() and as.list() to convert your data frame to a matrix or a list, respectively:

writers_matrix <- as.matrix(writers_df)
writers_matrix
##      Age.At.Death Age.As.Writer Name    Surname   Gender   Death       
## [1,] "22"         "16"          "Jane"  "Doe"     "FEMALE" "2015-05-10"
## [2,] "40"         "18"          "Edgar" "Poe"     "MALE"   "1849-10-07"
## [3,] "72"         "36"          "Walt"  "Whitman" "MALE"   "1892-03-26"
## [4,] "41"         "36"          "Jane"  "Austen"  "FEMALE" "1817-07-18"
writers_list <- as.list(writers_df)
writers_list
## $Age.At.Death
## [1] 22 40 72 41
## 
## $Age.As.Writer
## [1] 16 18 36 36
## 
## $Name
## [1] "Jane"  "Edgar" "Walt"  "Jane" 
## 
## $Surname
## [1] "Doe"     "Poe"     "Whitman" "Austen" 
## 
## $Gender
## [1] FEMALE MALE   MALE   FEMALE
## Levels: FEMALE MALE
## 
## $Death
## [1] "2015-05-10" "1849-10-07" "1892-03-26" "1817-07-18"

For those of you who want to specifically make numeric matrices, you can use the function data.matrix() or add an sapply() function to the as.matrix() function:

writers_matrix <- data.matrix(writers_df)
writers_matrix <- as.matrix(sapply(writers_df, as.numeric))

Note that with the current writers_df data frame, which contains a mixture of data types, NA-values will be introduced in the resulting matrices.

 

 

 

Categories: Programming

onesixx

Blog Owner

Leave a Reply

Your email address will not be published.