metainspector

Created: 2008-06-25 21:49
Updated: 2019-02-27 14:08
License: mit

README.md

MetaInspector Build Status Code Climate

MetaInspector is a gem for web scraping purposes.

You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...

See it in action!

You can try MetaInspector live at this little demo: https://metainspectordemo.herokuapp.com

Installation

Install the gem from RubyGems:

gem install metainspector

If you're using it on a Rails application, just add it to your Gemfile and run bundle install

gem 'metainspector'

Usage

Initialize a MetaInspector instance for an URL, like this:

page = MetaInspector.new('http://sitevalidator.com')

If you don't include the scheme on the URL, http:// will be used by default:

page = MetaInspector.new('sitevalidator.com')

You can also include the html which will be used as the document to scrape:

page = MetaInspector.new("http://sitevalidator.com",
                         :document => "<html>...</html>")

Accessing response

You can check the status and headers from the response like this:

page.response.status  # 200
page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
                      #   "cache-control"=>"must-revalidate, private, max-age=0", ... }

Accessing scraped data

URL

page.url                 # URL of the page
page.tracked?            # returns true if the url contains known tracking parameters
page.untracked_url       # returns the url with the known tracking parameters removed
page.untrack!            # removes the known tracking parameters from the url
page.scheme              # Scheme of the page (http, https)
page.host                # Hostname of the page (like, sitevalidator.com, without the scheme)
page.root_url            # Root url (scheme + host, like http://sitevalidator.com/)

Head links

page.head_links          # an array of hashes of all head/links
page.stylesheets         # an array of hashes of all head/links where rel='stylesheet'
page.canonicals          # an array of hashes of all head/links where rel='canonical'
page.feed                # Get rss or atom links in meta data fields as array

Texts

page.title               # title of the page from the head section, as string
page.best_title          # best title of the page, from a selection of candidates
page.author              # author of the page from the meta author tag
page.best_author         # best author of the page, from a selection of candidates
page.description         # returns the meta description
page.best_description    # returns the first non-empty description between the following candidates: standard meta description, og:description, twitter:description, the first long paragraph

Links

page.links.raw           # every link found, unprocessed
page.links.all           # every link found on the page as an absolute URL
page.links.http          # every HTTP link found
page.links.non_http      # every non-HTTP link found
page.links.internal      # every internal link found on the page as an absolute URL
page.links.external      # every external link found on the page as an absolute URL

Images

page.images              # enumerable collection, with every img found on the page as an absolute URL
page.images.with_size    # a sorted array (by descending area) of [image_url, width, height]
page.images.best         # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
page.images.favicon      # absolute URL to the favicon

Meta tags

When it comes to meta tags, you have several options:

page.meta_tags  # Gives you all the meta tags by type:
                # (meta name, meta http-equiv, meta property and meta charset)
                # As meta tags can be repeated (in the case of 'og:image', for example),
                # the values returned will be arrays
                #
                # For example:
                #
                # {
                    'name' => {
                                'keywords'       => ['one, two, three'],
                                'description'    => ['the description'],
                                'author'         => ['Joe Sample'],
                                'robots'         => ['index,follow'],
                                'revisit'        => ['15 days'],
                                'dc.date.issued' => ['2011-09-15']
                              },

                    'http-equiv' => {
                                        'content-type'        => ['text/html; charset=UTF-8'],
                                        'content-style-type'  => ['text/css']
                                    },

                    'property' => {
                                    'og:title'        => ['An OG title'],
                                    'og:type'         => ['website'],
                                    'og:url'          => ['http://example.com/meta-tags'],
                                    'og:image'        => ['http://example.com/rock.jpg',
                                                          'http://example.com/rock2.jpg',
                                                          'http://example.com/rock3.jpg'],
                                    'og:image:width'  => ['300'],
                                    'og:image:height' => ['300', '1000']
                                   },

                    'charset' => ['UTF-8']
                  }

As this method returns a hash, you can also take only the key that you need, like in:

page.meta_tags['property']  # Returns:
                            # {
                            #   'og:title'        => ['An OG title'],
                            #   'og:type'         => ['website'],
                            #   'og:url'          => ['http://example.com/meta-tags'],
                            #   'og:image'        => ['http://example.com/rock.jpg',
                            #                         'http://example.com/rock2.jpg',
                            #                         'http://example.com/rock3.jpg'],
                            #   'og:image:width'  => ['300'],
                            #   'og:image:height' => ['300', '1000']
                            # }

In most cases you will only be interested in the first occurrence of a meta tag, so you can use the singular form of that method:

page.meta_tag['name']   # Returns:
                        # {
                        #   'keywords'       => 'one, two, three',
                        #   'description'    => 'the description',
                        #   'author'         => 'Joe Sample',
                        #   'robots'         => 'index,follow',
                        #   'revisit'        => '15 days',
                        #   'dc.date.issued' => '2011-09-15'
                        # }

Or, as this is also a hash:

page.meta_tag['name']['keywords']    # Returns 'one, two, three'

And finally, you can use the shorter meta method that will merge the different keys so you have a simpler hash:

page.meta   # Returns:
            #
            # {
            #   'keywords'            => 'one, two, three',
            #   'description'         => 'the description',
            #   'author'              => 'Joe Sample',
            #   'robots'              => 'index,follow',
            #   'revisit'             => '15 days',
            #   'dc.date.issued'      => '2011-09-15',
            #   'content-type'        => 'text/html; charset=UTF-8',
            #   'content-style-type'  => 'text/css',
            #   'og:title'            => 'An OG title',
            #   'og:type'             => 'website',
            #   'og:url'              => 'http://example.com/meta-tags',
            #   'og:image'            => 'http://example.com/rock.jpg',
            #   'og:image:width'      => '300',
            #   'og:image:height'     => '300',
            #   'charset'             => 'UTF-8'
            # }

This way, you can get most meta tags just like that:

page.meta['author']     # Returns "Joe Sample"

Please be aware that all keys are converted to downcase, so it's 'dc.date.issued' and not 'DC.date.issued'.

Misc

page.charset             # UTF-8
page.content_type        # content-type returned by the server when the url was requested

Other representations

You can also access most of the scraped data as a hash:

page.to_hash    # { "url"   => "http://sitevalidator.com",
                    "title" => "MarkupValidator :: site-wide markup validation tool", ... }

The original document is accessi