Question

I need to separate the key and values from the text that looks like below

Student ID:  0
Department ID =          18432
Name                        XYZ

Subjects:
Computer Architecture
Advanced Network Security 2

In the above example Student ID, Department ID and Name are the keys and 0,18432, XYZ are values. The keys are separated from the values either by :,= or multiple spaces. I tried reg ex such as

    $line =~ /(([\w\(\)]*\s)*)([=:\s?]?)\s*(\S.*)?$/;
    $key   = $2;
    $colon=$3;
    $value = $4;

The problem I am facing is identifying when a word is separated with single space and when it is separated by more than one.

The output I get is line is Student ID: 0 key is Student , value is ID: 0 while I want key is Student ID and value is 0. For lines like Subjects: and Computer Architecture, the key should have Subjects and Computer Architecture. I have logic later when there is no value or colon, I append the strings to the previous key so it will look like Subjects=Computer Architecture;Advanced Network Security 2

Update: Thanks Ikegami for indicating that I use look behind operator. But I still seem to have problem solving it.

$line=~/^(?: ( [^:=]+ ) (?<!\s\s)\s* [:=]\s*|\s*)(.*)$/x;

So When I say (?<!\s\s)\s* [:=]\s*|\s* I mean when there more than two spaces, consume all the spaces and when there are no two consecutive spaces look for : or = and consume spaces. So if you pass below line to the expression, Shouldnt I be getting $1=Name and $2=ABC XYZ?

Name         ABC XYZ

What I seem to be getting is key is empty and value is Name ABC XYZ.

Was it helpful?

Solution

If

Name Eric Brine
Computer Architecture x86

means

key: Name Eric               value: Brine
key: Computer Architecture   value: x86

then you want

# Requires 5.10
if (/
   ^
   (?: (?<key> [^:=]+ (?<!\s) ) \s* [:=] \s* (?<val> .*  )
   |   (?<key> .+     (?<!\s) ) \s+          (?<val> \S+ )
   )
   \s* $
/x) {
   my $key = $+{key};
   my $val = $+{val};
   ...
}

or

if (/
   ^
   (?: ( [^:=]+ (?<!\s) ) \s* [:=] \s* ( .*  )
   |   ( .+     (?<!\s) ) \s+          ( \S+ )
   )
   \s*
   ( .* )
/x) {
   my ($key,$val) = defined($1) ? ($1,$2) : ($3,$4);
   ...
}

If

Name Eric Brine
Computer Architecture x86

means

key: Name       value: Eric Brine
key: Computer   value: Architecture x86

then you want

# Requires 5.10
if (/
   ^
   (?: (?<key> [^:=]+ (?<!\s) ) \s* [:=]
   |   (?<key> \S+ ) \s
   )
   \s*
   (?<val> .* )
/x) {
   my $key = $+{key};
   my $val = $+{val};
   ...
}

or

if (/
   ^
   (?: ( [^:=]+ (?<!\s) ) \s* [:=]
   |   ( \S+ ) \s
   )
   \s*
   ( .* )
/x) {
   my $key = defined($1) ? $1 : $2;
   my $val = $3;
   ...
}

Note that you can remove all the space and line breaks. For example, the last snippet can be written as:

if (/^(?:([^:=]+(?<!\s))\s*[:=]|(\S+)\s)\s*(.*)/) {
   my $key = defined($1) ? $1 : $2;
   my $val = $3;
   ...
}

OTHER TIPS

Try specifying the key part as two bits of text with an optional space in between;

$line =~ /([\w\(\)]*\s?[\w\(\)]*)\s*([=:]?)\s*(\S.*)?$/;

That should capture both one-word and two-word keys.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top