Quantcast
Channel: Parsing multidimensional data in paragraphs - Unix & Linux Stack Exchange
Viewing all articles
Browse latest Browse all 2

Parsing multidimensional data in paragraphs

$
0
0

I'm trying to parse data from a PDF report and filter out certain interesting elements. Using pdftotext -layout I get data in this format as my starting point:

Record   Info           Interesting  
123      apple          yep         
         orange         nope         
         lemon          yep          
----------------------------------------------- 
456      dragonfruit    yep
         cucumber       nope         
-----------------------------------------------
789      kumquat        nope         
         lychee         yep          
         passionfruit   yep          
         yam            nope         
-----------------------------------------------
987      grapefruit     nope         

My intended output is this - every 'Interesting' fruit and its record number except when the fruit is the first fruit in its record:

Record   Info
123      lemon
789      lychee
789      passionfruit

Currently, inspired by this question, I'm replacing the ------ record delimiters with \n\n and stripping out the record headers using sed. Then I can find paragraphs with matching records with awk:

awk -v RS='' '/\n   .....................yep/'

(Figuring out how to write {3}.{21} or similar with one of the awks is definitely a battle for another day :/ )

This produces the cleaned-up paragraphs like so:

123      apple          yep         
         orange         nope         
         lemon          yep          

789      kumquat        nope         
         lychee         yep          
         passionfruit   yep          
         yam            nope         

From here I could get the desired output by:

  • adding a second record number column, populated from the first record number column or the previous row's second record number column
  • delete rows which have a record number in the first column
  • delete rows which aren't intereresting
  • cut out the final columns

Am I going broadly in the right direction here, or is there a more straightforward way to parse multidimensional data? Perhaps by grepping an interesting row (has yep and no record number), then grep backwards from there to the next row with a nonblank record number?


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles



Latest Images