Contributing to the Tidyverse (dbplyr)
I was acknowledged as a contributor to the version 2.0.0 release of dbplyr
!
dbplyr
is the database backend for the ‘data
pliers dplyr
’ data manipulation package in
the tidyverse
software suite of R
statistical programming language.
Or to describe from ‘top-down’:
R
is a computer programming language used by statisticians and others who want to interpret data.tidyverse
is a collection of software packages for theR
language which makes it easier for R users to manipulate and process data. So much easier, thatR
is now taught to liberal arts post-graduate students to analyze data e.g. for environmental studies at Harvard Extension School. These students often have no prior experience in computer programming.The
tidyverse
was largely the creation of a New Zealander, Hadley Wickham, and it looks like he is the chief maintainer of thetidyverse
software. LikeR
,tidyverse
is ‘open source’, freely available for use and modification, and contributed to by many enthusiasts in the data science community.dplyr
is a software package in thetidyverse
collection which does many of the common data manipulation tasks, such as filtering, changing, sorting, summarizing and selection.dbplyr
allowsdplyr
to interact with database backends.
My contributions to the free and open-source dbplyr
are (ironically) related
to dbplyr
operation with Microsoft SQL Server ‘MSSQL’.
In all credit to Microsoft, the basic versions of Microsoft SQL Server are freely
available, as are client libraries (for use in Linux), and Microsoft also provides
extensive freely available documentation.
As of 21st December 2020, my two accepted contributions (‘pull requests’) are:
Cast
as.double
andas.numeric
toFLOAT
instead ofNUMERIC
In MSSQL,
NUMERIC
converts floating point number to integers, which is not what is intended foras.double
andas.numeric
inR
.Use
try_cast
instead ofcast
for MSSQL version 11+ (2012+)In MSSQL,
try_cast
allows more elegant handling of invalid entries.try_cast
returnsNA
(not available) in situations wherecast
will return an error.
As of 21st December 2020, I also have a currently open contribution (‘pull request’) to fix an error in my second contribution.
What I really would like to say is just how friendly Hadley Wickham and others
have been in helping me contribute to and improve dbplyr
.
Both in initial discussion and in the process of doing a ‘pull request’, Hadley and Kirrill Müller have answered the simplest of queries, amended my super-clumsy code and really encouraged me along! Hadley is an adjunct professor and something of a data science legend. I have not attended a formal computer programming class at high school, university or trade school, so I’m really humbled to feel like a valued contributor to the data science world.
(And why am I so interested in improving the operation of dbplyr
with MSSQL
?
It is because I use dbplyr
/dplyr
to interrogate the Best Practice
electronic medical record patient information database with my ‘near future’
patient care quality improvement tool GPstat!.)