Cleaning data in SPSS with name misspellings -
i have 5m records dataset in basic format:
fname lname uniqueid dob john smith 987678 10/08/1976 john smith 987678 10/08/1976 mary martin 567834 2/08/1980 john smit 987678 10/08/1976 mary martin 768987 2/08/1980
the dob unique, have cases where: same id, different name spellings or different id, same name
i got far making spss recognize john smit , john smith same dob same people, , used aggregate show how many times spelling used near name (john smith, 10; john smit 5).
case 1: loop through records people identified same person, , common spelling of person's name , use standard name.
case 2: if have multiple ids same person, take lowest 1 , make standard.
i comfortable using basic syntax clean data, thing i'm stuck on.
if uniqueid
real unique id of individuals in population , wanting find variations of name spellings (within groupings of these ids) , assign modal occurrence work:
string firstlastname (a99). compute firstlastname = concat(fname," ", lname"). aggregate outfile= * mode=addvariables /break=uniqueid firstlastname /count=n. aggregate outfile= * mode=addvariables /break=uniqueid /maxcount=max(count). if (count<>maxcount) firstlastname =$sysmis. aggregate outfile= * mode=addvariables overwrite=yes /break=uniqueid /firstlastname=max(firstlastname).
you overwrite fname
, lname
fields more assumptions have made, if example, fname
or lname
can contain space characters ect.
Comments
Post a Comment