I'm trying to parse data from a PDF report and filter out certain interesting elements. Using pdftotext -layout
I get data in this format as my starting point:
Record Info Interesting
123 apple yep
orange nope
lemon yep
-----------------------------------------------
456 dragonfruit yep
cucumber nope
-----------------------------------------------
789 kumquat nope
lychee yep
passionfruit yep
yam nope
-----------------------------------------------
987 grapefruit nope
My intended output is this - every 'Interesting
' fruit and its record number except when the fruit is the first fruit in its record:
Record Info
123 lemon
789 lychee
789 passionfruit
Currently, inspired by this question, I'm replacing the ------
record delimiters with \n\n
and stripping out the record headers using sed
. Then I can find paragraphs with matching records with awk
:
awk -v RS='' '/\n .....................yep/'
(Figuring out how to write {3}.{21}
or similar with one of the awk
s is definitely a battle for another day :/ )
This produces the cleaned-up paragraphs like so:
123 apple yep
orange nope
lemon yep
789 kumquat nope
lychee yep
passionfruit yep
yam nope
From here I could get the desired output by:
- adding a second record number column, populated from the first record number column or the previous row's second record number column
- delete rows which have a record number in the first column
- delete rows which aren't intereresting
cut
out the final columns
Am I going broadly in the right direction here, or is there a more straightforward way to parse multidimensional data? Perhaps by grep
ping an interesting row (has yep
and no record number), then grep
backwards from there to the next row with a nonblank record number?