Notes from the W3C SWS Testbed Incubator Group

From SWS Challenge Wiki

This page contains informal notes collected from discussions on the W3C SWS Testbed Incubator Group members mailinglist

There are currently four items being discussed in the group:

Item 1
Methodology for evaluating SWS (SWS-M). A number of SWS Challenges Workshops have been held, the idea is to distill lessons from this series of workshops for broader use by SWS community. Discussion Coordinator: Charles Petrie
Item 2
SWS ontology (SWS-O). This was proposed by Dieter Fensel during one of the SAWSDL discussions. Discussion Coordinator: Tomas Vitvar
Item 3
Semantic Annotation of RESTful services. Discussion Coordinator: Karthik Gomadam.
Item 4
Adding semantics to WS-Policy assertions. Discussion Coordinator: Ajith Ranabahu. See http://knoesis.wright.edu/projects/w3cxg/ws-policy-semantic-annotation.html

Contents

Methodology for evaluating SWS

Below follows a summary of the SWS-Challenge approach which is based on a first contribution by Charles Petrie and was later updated to reflect that state of discussion in the W3C SWS Testbed Incubator Group.

Proposal for a SWSC Methodology: Over the last two years and 5 workshops of the Semantic Web Services Challenge, we, as a community, have discussed and experimented with the best way to evaluate technologies for the mediation, discovery, and composition of web services.

The "Semantic" in the initiative title refers to the hypothesis that if more of the problem and solution semantics are made declarative and machine-readable, then programmer productivity for correct programs can be improved. SWSC workshop participants are expected, but not required, to formalize aspects of the SWSC scenarios in order to increase programmer productivity. Formulations should be shared and we hope that more successful ones will be re-used by other participants, which results in an informal measure of semantic success. Formalizations include annotations to more fully describe the web services in the scenarios as well as the problem domain.

One of the important goals of the SWSC is to develop a common understanding of the various technologies fielded in the workshops. So far, the approaches range from conventional programming techniques with purely implicit semantics, to software engineering techniques for modeling the domain in order to more easily develop application, to partial use of restricted logics, to full semantics annotating the web services.

Goals of the SWSC Initiative

The SWSC Initiative pursues three related goals:

  1. We try to promote a scientific understanding for the differences and tradeoffs of the various approaches to solve the SWSC problem scenarios. By comparing different implemented solutions to the same set of problems we are able to mutually learn about each other's technology.
  2. Based on an increasing set of problem scenarios we assess the functional coverage of different approaches and certify the extend to which approaches actually solved particular problems. This way we provide a certification of the capabilities of particular technologies.
  3. Uderstanding that all of the approaches are Turing complete, we try to to evaluate and compare the level of effort involved in solving particular problems. We investigate the basic assumption of Semantic Web Services that an increased usage of formal, declarative semantics will ultimately allow to build more flexible solutions whose adaptation to changes in the underlying problem scenarios require less effort and programming time compared to traditional programming approaches.

These goals manifest in the principles of the SWSC Initiative:

  1. We do not pre-suppose what technologies are best but rather evaluate them and certify the results as a result of solving common problems.
  2. We evaluate both the ability to solve a problem and the programmer effort in responding to a problem change.
  3. We are less interested in program speed than in correctness of program behavior and the degree of programmer productivity.
  4. We are interested in learning trade-offs among technologies and which formalisms are successful in which contexts.
  5. The evaluation results should be simple but useful to people deciding among technologies, especially within industry.

SWSC Methodology

We hope and expect that the results of the SWSC experiments will be replicated by other initiatives and that the current one will become a reference implementation. Toward that end, we abstract here the SWSC methodology as we have developed it by the end of 2007.

Problem Solving

  • There is a set of scenario problems described in English described on the initiative wiki. Each scenario has associated with it problem variations, starting with a first problem.
  • Corresponding to each scenario is a set of working web services that SWSC participants can access.
  • Some SWSC staff maintains these services and evaluates whether a participant has "passed" a scenario problem or sub-problem by examining the log of web service messages exchanged, or by using a mechanical verifier.

Workshop Agendas

  • Workshops are held to provide consensus verification and evaluation of claims of problem solving.
  • Participants present papers, perhaps having gone through a review process, in which claims are presented.
  • A SWSC staff member verifies whether the claimed problems have actually been solved.
  • The workshop participants, either in teams or as a whole, evaluate claims of ease of response to a problem change by evaluating the actual code.

Publication of Results

Evaluation results are publicly posted on the initiative website and certified by the consensus of the workshop in which they were made.

Evaluation of Code Issues

Code changes:

Initially, we attempted to do this by requiring that each participant submit their code after solving the first problem in each scenario. Only then, after this "code freeze", were they given the passwords to access the descriptions of the problem variations within a scenario. At the workshop, the frozen code and the final code were compared.

We found that this was very difficult to enforce and execute.

In subsequent workshops, we tried giving everyone access to all of the problem variations. However, this obscured how difficult changes were, since everyone could write for all solutions. And this was unfair to those who had participated in the original code freezes.

In Workshop 5, we prepared a "surprise" problem. Participants were evaluated on on their solution to this problem the following day. This seems feasible for some of the participants, but as a full-blown test, will require something like a code freeze for Workshop 6, if only a few days before the workshop.

As a standard methodology, we propose that participants submit relevant portions of their code either as an appendix to their papers or, as before, as ftp submissions to a "cold locker". Then, "n" days before the workshop, they are given the passwords to access the problem variations. At the workshop, their success in solving the problem variations is verified and evaluated.

Keeping the problem variations secret allows new participants to be evaluated on the same basis as previous ones. This makes it not as easy for the public to see what evaluations mean, as they cannot see the actual problem variation details. Nevertheless, this seems, so far to be the best method.

Finally, participants should be allowed and encouraged to submit new scenarios that other participants attempt. Authors of scenarios may be evaluated on their own scenarios but certifications will be marked to indicate that they wrote the problem or variation and that it was not a surprise problem for them.

Evaluation Rating

We initially tried to rank the submissions in difficulty of moving from one problem level or sub-level to another by trying to determine whether code was changed that would necessitate a re-compilation and linking, or whether there was only a change to the declaration of objects upon which the code acted. Further, we wanted to distinguish between whether the current declarations had to be altered, or whether new declarations were simply added. We found that these distinctions could not be made objectively. For example, if someone is writing in Lisp, there is no objective difference between declarations and code. XML schemas and Java present similar though less extreme problems.

We have resorted to a collective consensus on simply whether code or declarations have been changed as a measure of difficulty in moving from one level solution to another. This has been particularly challenging especially in approaches where solutions are synthesized by arranging software components in a graph with a GUI. One consideration has been whether changing the graph requires a re-compilation and linking, producing new code or whether this is essentially a declarative input to an engine, the code of which never changes: only its behavior.

Most recently, we decided just to put check marks in the evaluation table to indicate verification of a correct result, along with footnotes that annotate this result, by consensus. However, if the surprise problem methodology can be made to work, then we would recommend having, say, a plus, instead of a check mark, to indicate that the workshop decided that the problem change was handled with minimal programmer effort.

WSDL, REST, Other Industrially-relevant Specifications and Tools

We do want to deploy standards or specifications that are widely-used in industry in order to be relevant to industry.

We have started with three WSDL Web Services simulating a client trying to purchase goods using the RosettaNet protocol. Taking into account different versions of services and the mediation systems that have been implemented to test the system we are operating at present around 20 different Web Services. Layered on top of these web services are standard middleware such as Axis 2 and Tomcat.

Unfortunately, the complexity of the messages used has revealed several bugs in the implementation of the axis2 engine, which caused spending major resources just on the underlying technologies and not purely on the 'business' problem.

In fact it turns out that a variety of skills is required to master such a testbed. First, in-depth knowledge of WSDL and XML Schema to design proper service description utilizing the maximum of the descriptive power of the standards. Most obviously some knowledge on a web service engine (such as axis2) and the underlying application server (such as tomcat) is required as well as a fair amount of database design and web application programming skills. It also turned out to be necessary to understand a good deal about the Internet Protocol and firewalls in order to help participants to manage their invocations. And, last but not least, such an infrastructure requires some monitoring facilities that guarantee a 24/7 live system, which is not the usual approach in a university respectively research environment.

Effectively it demonstrated that in spite of the fact that Web Services are an established technology, current tools are only able to hide a small degree of the underlying complexity. As soon as we reached some border case, understanding of underlying protocols and standards was essential.

We will consider adding REST services to the scenarios because they seem to being used increasingly, and because so many middleware layers do not yet exist for this type of service. It remains to be seen whether this will be true in the future.

However, one of the issues that has caused the most trouble is the large Rossettanet messages. We maintain that this is an industry standard that has to be handled, and if it is causing a problem, then our SWSC has been successful in raising this problem. We will certainly consider less difficult scenarios, but the scenarios with these large messages remain an important part of the challenge and we recommend that other challenges also contain such problems.

Logical Problem Formalisms

Besides the technical challenge we realized another important point: We decided to not formalize the problems using a logical formalism, but rather to describe them using natural language documentation. Having to communicate with developers as well as participants, we conclude that only having text based documentation as a common model is suboptimal. We realized that a fair amount of the solution to the problems is its formal description. In fact, had we had such descriptions from the start we could have saved several iterations of discussion with developers.

However, it is not at all clear whether any one formalism for expressing the problem can be chosen that would be fair to all participants and perhaps would not promote a particular solution. For now, we recommend that the problem descriptions be in English, which also promotes new problem definitions.

Collaboration Infrastructure

Having effective means to share information between the organizers and the participants is another important aspect for a successful challenge. We have started with a set of static web pages, however it was soon clear that this is suboptimal. A Wiki that enables corrections and improvements on the documentation in a collaborative fashion turned out to be much more adequate. While this improved the efficiency of the discussions around the different problems sets, it turned out not to be enough to share descriptions of the solutions between participants.

Similar to the problems, also the solutions come with a fair amount of complexity. In order for a team to participate, we required to publish the declarative parts of the teams solution on the Semantic Web Challenge Portal. A Wiki did not provide sufficient means to share such complex structures, so in addition we created FTP accounts. However this turned out to be suboptimal: while it enabled to understand and verify a particular solution, the link between a solution's description in the papers submitted to the workshops, to the related discussion on the Wiki, and finally to the relevant parts of a solution's declarative description is too little integrated. We assume that this is one of the reasons why so far participants only share to a very limited amount of their formalizations.

A best solution for this issue has yet to be found.


Evaluation and Debug Infrastructure

Another aspect of involving real Web services is the possibility to automatically verify a solution by issuing a set of different messages and monitor the subsequent message exchanges. This is a useful feature, since it makes the challenge more scalable with respect to the number of participants - it essentially enables to automatically verify solutions. Moreover it allows for teams to participate not only during workshops, but also at any other time by just exposing their Web Services. Other people interested in the claims of a team can just use the online portal to start a test set against a particular solution and verify its coverage.

Another aspect is to offer some form of debugging support. Already with six teams it was quite often necessary to examine the application server's log, be it to determine a typo in the endpoint addresses used in a mediator implementation, or to identify an invalid message. Over time we added different views to the online portal that allows to examine parts of the message exchange and in particular the status of the systems involved. It is clear that the users, at a minimum, must be provided with log access.

Discussion Item 1.1: Input from People not involved in the SWSC

Several people suggested to ask members of the Incubator group and the SWSC PC that are not involved in the SWS-Challenge why this is the case. What are your opinions on what is missing, what is wrong or not optimal, etc. What are your reasons of not participating in the Challenge even though you are obviously interested in SWS testbeds/evaluation? Input pending.

Discussion Item 1.2: Other future goals of the SWSC

Uli: Additionally I would like to see more scenarios and more explicit statements for each scenario that tell you which aspects are addressed by and in the focus of this scenario. Ultimately there should be some matrix that tells you which problems are addressed/evaluated by which scenario. I don't think that a clear and complete understanding of the problem space to be addressed by the scenarios has yet been developed.

Discussion Item 1.3: Distinguishing Roles of Developer and Modeler

Tiziana: Given the focus on aiming for declarativity in our workshops so far, I would suggest that we distinguish the roles of a designer/modeller and a programmer. If a group is gathering knowledge in some model or formalism, and this knowledge is then used by a system that does the programming (by synthesis, by world wide wizard, whatever), the ratio between the first part, which I call design/modelling, and what remains of solution coding by hand is a meaningful measure. Even if qualitative.

Discussion Item 1.4: Measuring solution flexibility by involved programmer effort

General methodology

The proposed methodology to measure flexibility and ultimately programmer effort is to change problem scenarios and measure the amount of change required to adapt an existing solution to the changed problem.

It was discussed whether the solution for a problem variation, which was built by adapting a solution to the original problem, is required to still solve the original scenario. It was agreed upon that this is not a general requirement for all problems and problem variations. However, for some specifically designed problems this question could be an important evaluation criteria. It is assumed that more declarative solutions will be more likely to work with different problem variations in parallel. The amount of effort involved in creating solutions which have this capability is thus suggested as one evaluation criteria.

Preventing anticipating changes

The general issue with the presented methodology is that if all variations of a scenario are known beforehand, solutions to the base problem can obviously be designed in a way that minimizes the effort required for adaptation.

For the first evaluation at the Budva Workshop, this issue was addressed by keeping the problem variation (Mediation Scenario 2) secret until a code freeze of the solutions to the base problem (Mediation Scenario 1) was performed. It was generally viewed that such a code freeze involves too much effort on side of the organizers as well as the participants and doesn't scale. No further code freezes were thus implemented after the Budva workshop. Furthermore, both variations of the Mediation Scenario were unfortunately made public after the Budva workshop. Thus assessments of the effort involved in solving Version 2 of the Mediation Scenario on top of Version 1 could not be made objectively and were finally dropped at the second Stanford workshop.

Since a code freeze is viewed to involve too much effort, a new approach is currently being tested. A base problem is public and can be solved beforehand, but a problem variation is kept secret. During or shortly before the workshops, variations of the public scenarios are revealed to the participants who need to adapt their previous solutions within short time. People who submit surprise scenarios can be evaluated on those scenarios but it will be made clear, that the scenario was not a surprise for them. The extend to which they can be evaluated depends on the concrete setup of the scenario. The allotted time for performing the changes, the complexity of the base problem and the complexity of the changes will be chosen in a way that ensures that participants need to solve the base problem beforehand and will be able to only implement the necessary changes after the problem variation is released. This way, a code freeze is not necessary but it is avoided that solutions are implemented in the first place (or changed afterwards) in a way that eases the adaptation to the - then known - problem scenario changes.

At the second Stanford Workshop (November 2007) a problem scenario was revealed to the participants during the workshop. This scenario was related to, but did not really built upon previous scenarios. At the second day of the workshop small changes to the scenario had to be implemented. However, there were only two running solutions to the surprise scenario and the required changes were too trivial to really measure a solution's flexibility. It is planned to elaborate on the surprise problem variation methodology at the next workshops.

Measures for the amount of change

At the Budva Workshop the evaluation of the amount of change was performed according to the following levels:

  • Success Level 0 was assigned if the changed problem was correctly solved.
  • Success Level 1 was assigned when executable code had to be changed (the compiler or interpreter executed different instructions.)
  • Success Level 2 was assigned if only data had to be changed: no execution code had to be changed.

The assessment of the level was done by performing a code review of the solutions and discussion the results in the workshop until a consensus was reached.

The assessment of the success level prooved to be extremely difficult. Over the following workshops we came to the conclusion that a distinction between code and data could not be made objectively. The success level scale also did not really seem to be a good measure for involved programmer effort, since an adaption requiring a small change in code was deemed less successful than one which required substantial changes in data. Finally, the assessment of the success level was completely dropped at the second Stanford workshop.

It was also suggested to evaluate whether any existing code had to be changed or whether only new code was added. It was concluded that the distinction between "new code added" and "existing code changed" is as fuzzy as between compiled and not compiled code or between data and code and that we really do need a measure for programmer productivity/effort instead.

It was also suggested to objectively measure the lines of code changed. This however does not work when hand-crafted code is compared with generated code. When talking about changing code, we are interested in what possibly very high-level code a person changed, not what a tool generated as a result. This becomes somewhat problematic in cases where graphical programming tools are used and "code changing" means moving icons around in a graph in a GUI. That's why we objected to have a script count the changes and give a mark, but instead looked at the code and tried to come to an expert agreement about the amount of effort/change involved.

Overall it was concluded that we need to continue to use such a group consensus on how much has changed in adapting a solution. However, we still need to agree upon which metrics really to use for this. It was also remarked that evaluations done by different groups at different workshops become very difficult to compare. The extraction of more objective criteria that can bring more objectivity in the group judgement is planned for the future.

Discussion Item 1.5: Sharing of solutions

One of the main goals of the Challenge is to improve the mutual understandings of each others technologies. For this task sharing solutions and providing technical documentation of solutions is absolutely crucial.

Furthermore an evaluation made by group consensus is not fully objective. Currently, however, we have no better evaluation means than the group consensus at the workshops. This group consensus would become much more transparent and reliable, if solutions were documented in a way that independent people can later see and understand why the group came to a certain consensus. Without documented solutions (also at a technical level) this is impossible. If the certified solutions are not shared, the evaluation is not only subjective, but even worse intransparent and unreproducable.

I therefore do believe, that we need to explicit our requirements with regard to sharing and documenting solutions and propose the following:

Making people share solutions