Beginning xslt and xpath pdf




















This book will help you join the thousands of successful iPhone apps developers without needing to learn Objective-C or the Cocoa touch APIs. If you want to apply your existing web development skills to iPhone and iPad development, then now you can. Beginning iPhone iPad Web Apps takes you through the latest mobile web standards as well as the specific features of the iPhone and iPad. Youll discover the built-in hardware features of the iPhone and iPad and how best to take advantage of them.

The market for web apps for the iPhon Web frameworks are playing a major role in the creation of today's most compelling web applications, because they automate many of the tedious tasks, allowing developers to instead focus on providing users with creative and powerful features. XML processor the relational table.

We provide just uncompleted tables. The complex tables are available in Appendix D. However it is not shown in the example above, but the mixed content elements are also supported. It is certain that escaping string is unique, so it is possible to reconstruct the value to its original state.

Implementation The second generated table — the table of paths is usually smaller since one path can address more than one XML node.

It was necessary to use another technique for working with larger XML files, so a different transformation that uses a SAX parser was implemented. To run transformation, our application requires two parameters.

First is a path to the XML file on local disk. This path is also used to create the directory where the file containing the result table will be stored. The second parameter sets a mode, how the application should be run. It has three running strategies. Input parameter set as 0 activates a trans- formation that uses the SAX parser. It is recommended to use this mode for larger XML documents. It generates a table with structure Edge deweyId, type, value. DeweyId represents a dewey path from the root node of XML document to a single node.

In the further work we use term dewey path that better expresses meaning of the deweyId. Next two modes use DOM, so the processed document must fit into the memory. If parameter is set to 1, a structure of generated rows is the same as in the previous mode.

On the other hand, if the second parameter is set to anything else except 0 and 1, rows are stored as Edge deweyId, type, value, pathId and additionally, the second text file that contains paths to the specific nodes with structure Path pathId, path is created. We also decided to keep the transformation using DOM, because by using DOM we implemented the transformation of XML document into the relational table according to the original paper [24] that was followed.

In following Table 5. The sizes of files do not include the size of table of paths. Table 5. For a testing purpose the XML document was transformed by option 0 and for smaller files by option 1.

Implementation 5. We will use the Edge table generated by the previously mentioned transforming application by option 0 or 1. It means that a column pathId is not important for our processing seeing that all necessary information are included in the dewey path and pathIds can help just with the processing of child axis.

However we chose the dewey order encoding due to its sufficient performance in nodes selection and in updating, in this work just a node selection will be imple- mented.

Although the transforming application supports the mixed content nodes, this feature is not supported by XPath executor.

We decided to implement our driver program in Java. In this chapter we describe an implementation of driver program for Spark that is used to accomplish the goals of thesis. In this thesis we decided to focus on XPath axes. Here is a description of our ideas that are a basis for a translation of XPath axes into the SQL queries. The ideas are applied in our driver program. The translation of particular axes is based on the comparison of dewey paths.

It is necessary to say that it is a view from the context node. On the beginning of translation of every XPath query the context node is the document root, then context node is the result of the most recently executed XPath step. We use term context node that represents set of resultant nodes after the XPath step evaluation. In the following list the approaches to the various XPath axes are described. The dewey path of child node of some node is lexicographically greater and it contains one more path part, such as Difference is that descendant nodes may have more than one extra path part.

Hence the nodes whose dewey path is lexicographically greater than, or equal to the context node are taken. For following sibling axis is important to select nodes with the same path length as the context node. Dewey paths differ in their last part such as That also means that they have the same parent. Common attribute of nodes paths of following axis is that they do not contain dewey paths of context node as a prefix of their paths.

They are not prefixes of any path of context node and additionally also paths of context node are not prefixes of desired nodes. It should be considered that XPath allows filtering based on the node name, in this case just nodes where dewey path equals to the previous chosen nodes — context node are extracted.

Relation between a node and its parent node is the same as in child axis but in inverted meaning. The parent nodes are prefixes for the context node. Implementation If a concrete name in the node test part of desired axis is set, firstly, a dataset is filtered by the name and then the filtration through axis is applied.

If we want to see advantages of Spark we have to run our driver program on a cluster and in the cluster mode. It means that the file should be accessible for Spark driver and all worker nodes. Since when the file is accessible for members of Spark cluster it may be read. Each row is read and split to the Node object. Node object is then passed to RDD.

DataFrame may be created directly from RDD, but it is necessary to define a schema of table. We created a class Node, and by reflection the schema defined through Node class was applied on the RDD. A support for abbreviated forms of some XPath steps was added to the parser. Whole query is split to the separated XPath steps and the abbreviated forms are resolved.

Then all steps are evaluated step by step according to the desired axis. The step by step evaluation is implemented by indirect recursive algorithm. It means that every next step is dependent on the result of previously evaluated step. This parser is working just with axes that were mentioned in the Chapter 5. If conditions are fulfilled the true value is returned. By changing the order of parameters the user defined function isChild may be used for checking whether node is parent node of some other node.

True value is returned if the first argument is lexicographically less than, or equals to the second argument and the first is included within second.

We can simply match if the first path is prefix of the second because every dewey path starts with the part containing one or more zeros such as Zeros are located always as the first part of dewey path and not at the other position.

It also means that two parameters belongs under the same parent. Note that the user defined functions implemented in the broadcast lookup collection method extend the main idea of the functions mentioned above.

The strategies are characterized in more detail in Chapter 4. Implementation The first input parameter of the driver program indicates a path to the text file containing a transformed XML document. The second parameter expects an XPath query that will be evaluated, and the third parameter sets a running mode. Admittedly a newly created document should have been constructed in the document order according to the original XML document. If we want to build an XML document in a parallel way, it is hardly possible and impractical.

Even though each executor has its own partitions and theoretically can build XMLs from them, the final XML document cannot be built because executors do not have information about the nesting of XML elements. Hence we decided to store results in a way that do not require collecting data to the driver.

Accordingly, the result is not stored as a single XML file, but it is stored as a table including schema of stored data. Spark SQL also supports writing data to Apache Hive that is data warehouse software which facilitates querying and managing large datasets residing in distributed storage [17]. The same idea was applied to the storing of result.

XPath executor file was read. By default the result is not saved into the single file, but into the set of numerous files according to the count of partitions.

Implicitly one partition may be set and then result will be stored in one file. During the storing process a method toString is called on each element of DataFrame and elements are stored one by one per line. After executing all transformations and actions, the final result is ordered set of XML nodes. This set can contain duplicates. Normally, XPath pro- cessors that evaluate XPath queries always return a set of unique nodes and returned nodes can contain other nested nodes.

By evaluation of queries that use for example preceding or following axis, the returned nodes can also be nested in some other returned nodes. Reason why we have duplic- ates is, that we work with a set of self-contained nodes without any nested nodes. Even though we have duplicates, they will be stored on different levels in potential resulting XML document.

Finally we decided to store result as a JSON file. Even though it is the specially formed JSON it still can be processed by other technologies that do not support other output formats available in Spark.

We implemented methods that transform nodes back to the XML file and keep document order according to the original XML. The first method uses DOM representation and as we mentioned in Chapter 5.

So with larger results it fails. This method takes the first node from the result and creates a new element in DOM. Then all its child nodes are selected and assigned to their parent. If some node was already assigned it is deleted from the set of resultant nodes. These steps are repeated for each child until there is no other child node or the set is empty. If there are still some nodes in the resultant set, the algorithm continues and creates following sibling for the first created element.

Since it is easy to rewrite DOM method to sequential manual creation of XML elements, we provide also an algorithm for processing of larger results.

Implementation processed node is directly written into the file. In the manual creation the nodes are read sequentially. Both methods do not require resultant nodes to be in document order, but the resultant set must be sorted. Note that nodes in the document order and sorted nodes are not the same since resultant set can contain duplicates.

Spark has various options to be configured to run application faster. In the following sections three methods that could be helpful are described. Too much network communication has a negative impact on the performance [25]. Working with two DataFrames where one of them is smaller than the second, and that fits into the memory, the broadcast variable can be useful to avoid the shuffle operation.

It is preferable to broadcast smaller DataFrame and use it as an immutable lookup table. In the following Table 5. We tested broadcast variables on the simple join of two tables where one was more than hundred times smaller. XPath executor 5. An advantage is that the total number of partitions is con- figurable. Having numerous executors and working with just one partition has no sense. By default Spark splits data at least to the number of parti- tions which relates to the number of cores that are available on all executor nodes whereby default maximal size of partition is MB in a cluster mode and 32MB in a local mode.

Very large partitions and on the other hand very small partitions might have bad impact on the performance. If re- quired count of partitions is set to the bigger value than the count of read items, the empty partitions are created, so it also is not a good idea. Count of partitions should be balanced. In Table 5. Some transformations such as join can automatically increase number of partitions when it is needed. Hence Spark provides a coalesce method that decreases the number of partitions and does not shuffle data over the network.

We tested various size of par- titions on our biggest table by using early version of broadcasted lookup table method that used a join as the last step of algorithm. We did one more test in which rows were just filtered and modified, so these operations did not require a transfer data to other executors, since executors have worked only with their partitions.

Result of the test showed that in this case it was better to keep a default partitioning. Implementation called on the same collection, by default it is recomputed repeatedly.

From a time point of view it can be costly to repeat the same computation more times. Spark provides a possibility to make a checkpoint for the further computations. Spark allows several storage levels such as memory, disk, memory and disk if data do not fit into memory they are persisted to disk in serialized or not serialized form.

Persisting is not invoked by calling applicable method, but it is invoked after the first action that is called on collection. We provide a Table 5. We were testing on 2. To show the difference, we evaluated longer XPath query. As we can see from our measurement, when actions are invoked frequently, caching has really positive impact on performance. The biggest changes were realized between versions 1.

In the following list the changes that had the biggest effect on our application are mentioned. We developed our driver program in Java. After some time we realized that it probably was not the best option. Working with cluster in Java. From this moment we were able to run our driver program in the cluster mode and see all advantages of parallel computing. At the beginning of this chapter we showed Figure 5.

By using a cluster, the cooperation is a bit different since the text file to be processed and the driver program must be available for all members of cluster. Figure 5. Implementation With a small input file it is not possible to see benefits of computation on cluster since in some cases communication load can take more time then computation. The computation on cluster forced us to use the Hadoop Distributed File System to make our text files visible for workers.

On cluster we continued our test attempts with the bigger files since sufficient amount of memory was available among the worker nodes. The parameters of the local testing machine that was used for experiments were described at the beginning of the Chapter 4. The available cluster on that the experiments were run consists of 4 virtual machines hosted on four processors Intel Xeon 3. Version of Spark that was used for the experiments was 1. Comparison of computation in local mode and cluster mode brought expected results.

Admittedly computation on cluster with enabled cluster mode was faster in some cases. In these experiments the fastest method that uses nested lookup collection was used. It can be caused by the cluster overhead expenses such as serialization and transporting data among other workers. A graph in the Figure 5. Summary Figure 5. The first application pre- pares XML documents to the form to be able to be processed by the second application.

In this chapter performance of the individual methods was compared. The final interpretation of performance testing is in the Chapter 6.

Tested attempts are focused mainly on functionality of our solution. Also the results of perform- ance testing of implemented methods are presented in this chapter. Within the functionality testing we decided for manual and unit testing. During the implementation of driver program we were working with couple of XML documents and we created a numerous XPath queries that cover most of query cases.

By hand we tested each step of each XPath query and we aim our attention on the count of returned nodes. We used BaseX in version 8. After that we compared count of returned nodes from our application and from BaseX. Comparisons of counts was not sufficient enough since we did not know if correct nodes were returned. We realized that some automatic tests that compare results of complex XPath query are needed.

Although we have the separated functions for the translation of each axis, it is a bit complicated to test them since called transformations are evaluated lazily. Testing and Experiments before they are integrated into the bigger units.

If smaller units work as it is expected, most probably bigger units will also work. We follow this idea and we deal with lazy evaluation. Firstly we wrote tests just for functions that evaluate child and descendant axes so if these tests pass we can use them in the other testing queries such as parent, preceding or following axes that cannot be tested from a document node.

Note that the document node is an alternative to doc "xmlFile. Then we created a set of XPath queries different from manual testing that contains two XPath queries for each axis.

The same XPath queries are used in the tests for all the implemented methods. We again used BaseX to evaluate all the created XPath queries. Then each result of evaluation was transformed to the list of nodes via our XML transformer and they were stored to the test environment. So we had the correct results for each XPath query that we would tested. Finally we could evaluate queries in XPath executor and compare their results with the results from BaseX.

We created eleven test cases, one for each axis. In each test two queries are tested as it was mentioned. It is not needed to test the document order of resultant nodes since the algorithms for the XML file creation presented in the Chapter 5. During the implementation, the unit tests helped with a bugs finding.

All bugs were solved and all the tests have passed. We collected all measured values into the one summarizing Table 6. Table 6. Experiments All the measurements in the table below were realized locally. The measuring of the value marked with was interrupted due to its slowness. Measured values are projected on a graph in Figure 6. Numbers in header of table are count of processed axes. Figure 6. Testing and Experiments It can be seen that method using Cartesian product is really slow.

In the graph, there is one more interesting thing. We can see positive impact of caching. According to the Table 6. Since our methods evaluate an XPath query step by step, impact of the number of evaluated steps can be seen in the graph. This implies that more XPath steps mean longer computation duration since there is no optimization used, so all steps are evaluated. All prepared tests passed so we can suppose that XPath queries are evaluated correctly. Also the performance of our methods was tested and we found out, that the creation of new methods led to an increased computational speed.

It can be seen on the number of new releases that have been released relatively often during the work on this thesis. For this purpose two applications were designed and developed. The first takes an XML document and transforms it into the tabular form so it can be used and processed via the second application. The second application is a driver program of Spark.

In this thesis, five different methods of our approach were introduced and also the performance comparison of single methods was presented. However, in the final driver program there are just three methods implemented. It is because one method using Cartesian product was extremely slow, and since it was determined that by executing same query using SQL and DataFrame the same physical plan is generated, just SQL method was implemented.

While working just locally with really small data, also the method using Cartesian product was relatively fast. Working with cluster showed that by processing bigger data, the trivial method have had to be improved. A subset of expressions from XPath query language that are supported by the implemented methods contains all XPath axes except the axes of attribute and namespace.

A functionality of methods was tested by two independent methods, manually and by unit tests. Several tests were created for single axes. Within implemented methods all tests successfully passed, so the main goal of the thesis was fulfilled. Sometimes it was not easy to understand why an application on cluster fell down whereas in local mode driver program was successfully executed.

Introducing XML. Pages Manipulating Atomic Values. Variables and Parameters. Paths and Sequences. Result Trees. Sorting and Grouping. IDs, Keys, and Numbering. Named Templates, Stylesheet Functions, and Recursion.



0コメント

  • 1000 / 1000