質問

For quite some time I have been trying to format space separated data to a CSV structure.

Initial position

The initial data table is given by:

Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE    Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment   
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic    Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment   
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center     Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment

It contains lots of spaces and unnecessary information throughout. The information is present somewhat like this

Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.

I want to convert it to the following format

Doctor's name,Specialization,Hospital name,Address,Fees,Schedule

So the current data should look like this

 Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
 Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM   
 Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM

Till now I have succeeded in removing the Book Appointment field.

Problem

However I am facing difficulties in classifying the hospital's name. As the spacing in it varies a lot. Is this problem feasible?

EDIT

The output of cat -A file is the following:

 Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
 Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
 Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
役に立ちましたか?

解決

There's no straightforward way to separate the specialization from the hospital name, but with some assumptions, you could perhaps use perl to do this:

perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file

Gives:

Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM

And since it's perl based regex, you can use regex101 to get a glimpse of how it works through the regex debugger. The regex is quite straightforward, but the fact that there are many parts can make it appear daunting.

Warning: The above is able to separate the specialization based on two things:

  1. It tries to find the first occurrence of space followed by two uppercase characters or digits and starts matching as the hospital name when it finds it; or
  2. If there are no consecutive uppercase characters or digits, it takes only the first word as the specialization and the rest as the hospital name.

I know it might not solve the complete problems as there are always lines that won't fit the above rules, but that can get you started on cleaning these up. If there is anything incorrectly separated (i.e. when the specialization consists of more than 1 word and the hospital name doesn't have two consecutive upper/digit) you will have one word of the specialization correctly placed, and the rest in the hospital name.

他のヒント

Unfortunately, based on your input, there's no way to separate specialisation with hospital name. The other fields can be captured, albeit inelegantly and with gawk (probably >= 4.0, but I think 3.x should work):

$ awk -F" \t " -v OFS="," -v S=" " '
{
    sub(/\s+$/, "");
    split($2, Data, /[ ,]{2,}/);
    Address  = Data[1];
    split($2, Data, / +/);
    nData    = length(Data);
    Schedule = Data[nData - 2];
    Fees     = Data[nData - 4] S Data[nData - 3];
    split($1, Data, / +/);
    Name     = Data[1] S Data[2] S Data[3]; # assume all names are Dr. Xxx Xxx only
    match($1, /[0-9]+ years experience /);
    SpecializationHospital = substr($1, RSTART + RLENGTH);
    print Name, SpecializationHospital, Address, Fees, Schedule;
} ' data.txt
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top