Forum Discussion
rminnich
15 years agoContributor
You could parse it using something besides a regex. You can manually step through the string extracting or removing extraneous information. For example, you know that the <html> and </html> tabs can be discarded. Everything between <style> and </style> can be discarded. The <br/> tag can be replaced with a linefeed/newline.
It gets slightly trickier for <a> tags in that you only care about the element value and not the attributes.
I could also see treating this html as XML and using a SAX parser to grab the text needed. That would be a more complex solution.