Cleaning data in SPSS with name misspellings -


i have 5m records dataset in basic format:

fname lname uniqueid dob        john  smith  987678  10/08/1976 john  smith  987678  10/08/1976 mary  martin 567834  2/08/1980  john  smit   987678  10/08/1976 mary  martin 768987  2/08/1980  

the dob unique, have cases where: same id, different name spellings or different id, same name

i got far making spss recognize john smit , john smith same dob same people, , used aggregate show how many times spelling used near name (john smith, 10; john smit 5).

case 1: loop through records people identified same person, , common spelling of person's name , use standard name.

case 2: if have multiple ids same person, take lowest 1 , make standard.

i comfortable using basic syntax clean data, thing i'm stuck on.

if uniqueid real unique id of individuals in population , wanting find variations of name spellings (within groupings of these ids) , assign modal occurrence work:

string firstlastname (a99). compute firstlastname = concat(fname," ", lname"). aggregate outfile= * mode=addvariables /break=uniqueid firstlastname /count=n. aggregate outfile= * mode=addvariables /break=uniqueid  /maxcount=max(count). if (count<>maxcount) firstlastname =$sysmis. aggregate outfile= * mode=addvariables overwrite=yes /break=uniqueid /firstlastname=max(firstlastname). 

you overwrite fname , lname fields more assumptions have made, if example, fname or lname can contain space characters ect.


Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -