viernes, 20 de abril de 2012

Sanitizing data in SAP HANA with R


From April 10 to April 11, my team (Anne, Juergen and myself) host an InnoJam in Boston. It was a really great event, but the data provided by the City of Boston wasn't exactly in the best shape, so we took a lot of efforts (with a help of the SAP Guru's that helped us) to sanitize the data.

At some point, as I was asked to use my Regular Expressions skills (everybody knows I'm crazy about RegEx) to sanitize some data that came into a really weird format inside a field...something like this (Can't post the real data for obvious reasons):

Type: [This is a test] Area: [My Area] Description: [This data is not right]

What I need to do was basically, took each record and generate a new table with those 3 fields and end with something like this...

TypeAreaDescription
This is a testMy AreaThis data is not right


I began to think on which language could be the best for the job...I thought on ABAP, Python...and then of course...R...so I choose R.

The problems that immediately arise were simple, how to pull data from HANA and how to send it back, also, even when I believe myself a RegEx Hero, R uses a not very standard RegEx schema. I thought on downloading the data from HANA as .CSV, clean it up and then upload the .CSV back to HANA...but then I thought that the extra work was not worth it...so then...my good old friend RODBC came into the show...even when it's not supported by SAP, I decided that for this particular case, it would be just fine...I could read the data back and forth and have everything back into HANA in a very fast way.

Let's create a table and call it BAD_DATA...and just create 10 dummy records (I know...I'm lazy)...


So this is the script:

library("RODBC")
ch<-odbcConnect("HANA_SERVER",uid="P075400",pwd="***")
query<-("select case_description from P075400.BAD_DATA")
CRM_TAB<-sqlQuery(ch,query)
 
SR_Type<-c()
SR_Area<-c()
Description<-c()
 
for(i in 1:nrow(CRM_TAB)){
  mypattern = '\\[([^\\[]*)\\]'
  datalines = grep(mypattern,CRM_TAB$CASE_DESCRIPTION[i],value=T)
  getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
  g_list = gregexpr(mypattern,datalines)
  matches = mapply(getexpr,datalines,g_list)
  result = gsub(mypattern,'\\1',matches)
  var<-0
  i<-0
  for(i in 1:length(result)){
    var<-var+1 
    if(var==4){
      break
    } 
    if(var==1){
      SR_Type<-append(SR_Type,result[i])
    }
    if(var==2){
      SR_Area<-append(SR_Area,result[i])
    }
    if(var==3){   
      Description<-append(Description,result[i])
    }
  }
  if(length(SR_Type)>length(Description)){
    Description<-append(Description,NA)
  }
  if(length(SR_Type)>length(SR_Area)){
    SR_Area<-append(SR_Area,NA)
  } 
}
 
GOOD_DATA<-data.frame(SR_Type,SR_Area,Description,stringsAsFactors=FALSE)
sqlDrop(ch,"GOOD_DATA",errors=FALSE)
sqlSave(ch,GOOD_DATA,rownames="id")
odbcClose(ch)


So basically, but we're doing is to grab all the information inside the brackets, then pass it to vectors to finally create a Data.Frame and send it back to HANA. If you wonder why I'm comparing the length of the different vector to add NA values, it's very simple...we can have something like this...

Type: [This is a test] Area: [My Area] Description: [This data is not right |

The last bracket...it's not a bracket! It's a pipe, so the RegEx is going to fail and this will provoke the vector to be empty and that would be messy...if that happens with can add an NA and at least have a value on it...

So...when we run our program...a new data called GOOD_DATA is going to be created with all the data clean and sanitized.


Nice, right? -;)

See you in my next blog! -:)

2 comentarios:

Unknown dijo...

Hello Alvaro,

I can saw some posts yours regarding Integration by SAP HANA and R. But, until de moment, I can't do this integration.

I have the HANA and R in the same eclipse. Then, Is it need do the connect by ODC?

If yes, pls, help with this information.

See code below.
library("RODBC")
ch<-odbcConnect("HANA",uid="P075400",pwd="HrCOpPk4")

Questions:
"HANA" -> How know servidor name in Hana?
uid -> This is user of the Hana?
pwd -> This is password of the Hana?

I'm questining it, because I tried same options and I can't conected.

Thank You

Alvaro "Blag" Tejada Galindo dijo...

In my example "HANA_SERVER" is the name of the ODBC connection...you can check here http://blagrants.blogspot.com/2012/02/hana-meets-r.html on how to set it up...

Yes, uid and pwd are the user name and password for your HANA server...

And BTW...P075400 was my username so you need to use your own username...

Greetings,

Blag.
Development Culture.