Forum Discussion
rminnich
14 years agoContributor
You could parse it using something besides a regex. You can manually step through the string extracting or removing extraneous information. For example, you know that the <html> and </html> tabs can be discarded. Everything between <style> and </style> can be discarded. The <br/> tag can be replaced with a linefeed/newline.
It gets slightly trickier for <a> tags in that you only care about the element value and not the attributes.
I could also see treating this html as XML and using a SAX parser to grab the text needed. That would be a more complex solution.
Related Content
- 8 years ago
- 12 years ago
- 8 years ago
- 8 years ago
- 12 months ago
Recent Discussions
- 3 days ago
- 3 days ago
- 7 days ago