RSS

Monthly Archives: December 2008

Internet Screen Scraping

Abstract

Screen scraping is the act of programmatically extracting data displayed on the screen, evaluating the data and extracting the information which is required into an understandable format. This technology is widely used by search engines for collecting information from different web sites, integration tools for transferring data between systems, business intelligence for strategic business decisions, meta-search etc. The application of this technology is innumerable, but careless use can result in legal issues. This white paper is a case study on an Internet Screen Scraping solution developed for the regular Web Tasks and extracting information from any Web based systems.

Internet Screen Scraping

Screen scraping is the act of programmatically extracting data displayed on the screen, evaluating the data and extracting the information which is required into an understandable format.

There are a plethora of applications for screen scraping, sometimes this is done simply to display the information in a richer fashion and sometime to send information to another system. A common example for screen scraping is Web search engines. They use spiders which uses screen scraping for extracting and recording key words from HTML.

Screen scraping is programming that translates between legacy application programs (written to communicate with now generally obsolete input/output devices and user interfaces) and new user interfaces so that the logic and data associated with the legacy programs can continue to be used.

In this white paper, we will be discussing only about Internet Screen scraping. Internet screen scraping or Web scraping means programming that translates data between a source web application and a destination application. The source application or the application, which needs to be scraped, should be a ‘scrapable’ application. By ‘scrapable’ we mean, the application that is web based, having HTML front end that can be parsed.

For a scrapable application, the majority of the information that is of interest will be textual and can be parsed. Web applications, which are developed using Developer 2000, Java applets etc, are out of scope, as the content is not parsable. These types of applications require the conventional Screen scraping techniques (using terminal emulation, character reading etc). The scrapable web applications will be developed using the technologies like Java, JSP, ASP, .Net, and other web based technologies. Only thing to keep in mind is that the output pages of the scrapable application, generated by the web server, should be having textual content. 

Why Web Scraping?

Internet Screen Scraping can be used in the following areas:

  • Application Integration 
  • Data Mining and Extraction 
  • Data Migration 
  • Business Intelligence 
  • Web Task Automation 
  • Portal Components 
  • Meta-Searching 
  • Archiving

Application Integration

Organizations often have multiple web-based applications for their business process. The communication between these systems will be through manual processes and can be erroneous. Screen scraping can be used to develop bridging software that can integrate different applications, creating one application from different applications, transferring data from one system to another in real time.

  • Transferring data between different systems in real time 
  • Bridging external applications to internal systems 
  • Building a single interface from many independent systems

Data Mining and Extraction

Extracting data from secure web sites require authentication, browsing to a particular web page, entering certain parameters etc. The resultant web page containing the relevant information needs to be validated, converted into a different format, inserted into a database or need to be passed to another application. For example few applications of Data Mining and extraction is given below:

  • Pulling product data from multiple web sites to create meta-search sites
  • Extracting news headlines to use in RSS feeds 
  • Archiving web pages or resources

Data Migration

Most of the legacy systems provide read only access to the data behind them. For the migration of this data to a more flexible system involves huge amount of costly manual effort and can be error prone. Data should be migrated from legacy systems without manual intervention to minimize these risks.

  • Migrating data from legacy applications 
  • Migrating data from one application to another when upgrading 

Business Intelligence 

For success of a company, one has to understand their market and competitors, but information about others’ products and services is valuable data that the competitor is not likely to want to share, except through a publicly accessible website.

  • Comparing product listings and prices of a competitor
  • Extracting news relevant to your marketplace on a routine basis
  • Monitoring new products and service launch

Web Task Automation

Different web tasks like data entry in a form, testing web server’s output, monitoring business transactions etc require manual effort and web navigation. The data that needs to be entered can be from a database or other application. These tasks are tedious and error-prone. These types of activities can be automated using screen scraping. The data can be from different databases or from different other web sites.

  • Monitoring of Web portals, complex transactions, transactions that require input from different other data bases or other web applications.
  • Web site testing

Portal Components 

Developing portal components can be a challenge when there is only a common web-interface between applications. Interfacing different applications require knowledge of the application APIs and if known creating complex middleware. Building the portal components should not affect the parent applications.

  • Common look and feel across multiple applications 
  • Creating portlets (embedding services like weather forecast)
  • Combining applications into a single interface

Meta-Searching 

Meta searching is defined as the ability to search multiple web sites and judge the results. Many sites do not provide a consistent interface. For meta searching, the data from each site needs to be extracted and stored in a uniform format. This is required for search results consistency.

  • Creating meta-search web portals for price matching. 
  • Conducting product research

Archiving

The dynamic nature of web sites results in the contents changing frequently. A snapshot of the web pages is required for archiving purpose. These snapshots should be able to store the page required at the time including the multimedia content.

  • Archiving receipts of web transactions 
  • Building static web sites from dynamic content 
  • Archiving snapshots of web content for analysing historic data

A Case Study

Screen scraping is successfully implemented and widely used in day to day activities in our project . The architecture of the application is illustrated below. The design and development of this solution is indigenously done within the project. 

scrap
 
2 Comments

Posted by on December 20, 2008 in Web

 

Tags: , ,

Added Portal Support in SmartGWT Showcase

Portal Post

A couple of users have asked about portal support. Writing portals in SmartGWT / SmartClient is fairly simple as the Layouts have animation, and dynamic add / remove built-in. Here’s an example of a SmartClient Portal     

http://www.smartclient.com/smartgwt/…animation.html

API’s for this will be added in the next release, and likely a nightly build over the next few weeks. 

Sanjiv

Here’s an example of a SmartClient Portal 

http://www.smartclient.com/smartgwt/…animation.html

Screenshot:

portlet
 
Leave a comment

Posted by on December 3, 2008 in GWT

 

Tags: , , ,

Stripes Framework

Introduction

Stripes is an Open Source web application framework using MVC pattern. It aims to be very light weight, more lightweight than struts by using java technologies new framework annotations and generics that was introduced in Java1.5 . Every existing framework requires gobs of configuration Struts, MVC-Springs or Web Work2. Though all these framework are pretty much in use but wastes lots of time in serious architectural and configuratonal issues.

The normal flow of events and components that are typical for applications that are written with Stripes

image11

Key Features:

  • Zero external configuration per page/action (ActionBeans are auto-discovered, and configured using annotations)
  • Powerful binding engine that will build complex object webs out of the request parameters
  • Easy to use (and localized) validation and type conversion system
  • Localization system that works even when you use direct JSP->JSP links
  • Ability to re-use ActionBeans as view helpers
  • Ridiculously easy to use indexed property support
  • Built in support for multiple events per form
  • Transparent file upload capabilities
  • Support for incremental development (e.g. you can build and test your JSP before even thinking about your ActionBean)
  • And a lot of built in flexibility that you only have to be aware of when you need to use it

Building first Stripes Action:

public class HelloWorldAction implements ActionBean {   

               @ValidateNestedProperties({

              @Validate(field = “firstName”, required = true, on = {“hello”}),

               @Validate(field = “age”, required = true, minvalue = 13, on ={“hello”})

                })

             private Person person;

            private ActionBeanContext context;   

            @DefaultHandler

            public Resolution index() {

            return new ForwardResolution(“Hello.jsp”);

             }       

            public Resolution hello() {

            return new ForwardResolution(“SayHello.jsp”);

            }

             public void setPerson(String person) {this.person = person;}

             public String getPerson() { return person;}

            public void setContext(ActionBeanContext c) {this.context = c; }

            public ActionBeanContext getContext() {return context; }

}

Understanding the Concept Behind:

The controller class resembles a POJO (Plain Old Java Object) that implements a Stripes-specific interface called ActionBean. All Stripes actions need to implement this interface to allow the StripesDispatcher servlet to inject an ActionBeanContext object into the current action being serviced. The ActionBeanContext object allows you to access servlet API objects such as the request, response, and servlet context. Most of the time it is not necessary to access these low-level API objects in a Stripes application. The ActionBeanContext class also allows you to get state information about the current action as well as add informational messages and error messages from the current action. The ActionBeanContext field and accessors can be stored in a base class since all Stripes actions will require this implementation.

The rest of the controller class should be familiar to any Java developer. There is a Person object with accessors that will be used to read and write our person’s first and last name to our views. While this is a simple nested object, Stripes allows more sophisticated data binding with Java collections, generics support, and indexed properties. Since Stripes can handle complex data binding, your domain objects can be reused in other layers that need them. For example, it is easy to collect information in a domain object via Stripes and make persistent changes with other POJO frameworks like Hibernate or EJB 3.

The view and the controller are also coded in the stripe way to make the world easier.

Stripes Over struts:

Number of artifacts: Struts is the fact that just to implement a single page/form, I have to write or edit so many files. And I have to keep them in sync, or else things start going horribly wrong. With Struts I have to write my JSP, my Action, my Form, a form-bean stanza in the struts-config.xml, an action stanza in the struts-config.xml, Compare this with Stripes. I write my JSP. I write my ActionBean and annotate it with a @UrlBinding to specify the URL it should respond to, and one or more @HandlesEvent annotations to map events to methods. I’m done. All the information about the form and the action is in the same place.

Incremental development: Write JSP, see if it looks ok, then go and write the back end components to go with it. And Stripes provides you this feature.

Property binding: Struts lets you use nested properties but stripes will instantiate everything.

Validation: In Stripes validation is tied closely to type conversion. A number of commonly used validations can be applied pre-conversion using a simple annotation. This includes things like required field checks, length checks, regex checking etc.

Multi-Event Actions: If you want to have a form that submits multiple different events in Struts you have to either extend the DispatchAction or write your own support for it. And since the DispatchAction requires all buttons to have the same name, and uses the value to determine the method to invoke, it’s a huge pain if you’re using localized or even just externalized values for your buttons.Stripes uses the name of the button itself, and has built in support for multi-event Actions. You can localize to your heart’s content, and Stripes will detect which button was pressed and invoke the right method for you.

JSP / View Helpers: Struts doesn’t really provide a good pattern for providing dynamic data to JSPs that are not the result of another Action. Stripes has a neat way of handling this. A custom tag allows the use of ActionBeans as view helpers. It works similarly to the jsp:useBean tag in that if the ActionBean alreadyexists it just gives you a reference to it. If it doesn’t exist, the tag will bring it into existence, bind data out of the request on to it, and get it ready for use.

“HTML” Tags : The Struts form input tags use ‘property’ instead of ‘name’? And why is it ‘styleClass’ instead of ‘class’? It also makes it hard to change a tag back and forth from a plain HTML tag to a Struts tag. Stripes takes pains to make all the form input tags as close to (if not identical to) their HTML counterparts as possible.

Goals of Stripes:

  • Make developing web applications in Java easy.
  • Provide simple yet powerful solutions to common problems
  • Make the Stripes ramp up time for a new developer less than 30 minutes
  • Make it really easy to extend Stripes, without making you configure every last thing.

Latest Version:

  •  Stripes1.5 is the latest version Released.

References:

 

 
2 Comments

Posted by on December 1, 2008 in Web

 

Tags: , ,

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: