Friday, July 4, 2008
From data warehousing to data mining?
Data Warehouse Usage
1) Three kinds of data warehouse applications
i) Information processing
a) supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
ii) Analytical processing
a) multidimensional analysis of data warehouse data
b) supports basic OLAP operations, slice-dice, drilling, pivoting
iii) Data mining
a) knowledge discovery from hidden patterns
b) supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools.
Further development of data cube technology)
Discovery-Driven Exploration of Data Cubes
1) Hypothesis-driven: exploration by user, huge search space
2) Discovery-driven
a) pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation
b) Exception: significantly different from the value anticipated, based on a statistical model
c) Visual cues such as background color are used to reflect the degree of exception of each cell
d) Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction
Complex Aggregation at Multiple Granularities: Multi-Feature Cubes
1) Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 1997 for each group, and the total sales among all maximum price tuples
select item, region, month, max (price), and sum (R.sales)
from purchases
where year = 1997
cube by item, region, and month: R
such that R.price = max(price
Data warehouse architecture?
Data Warehouse Design Process
1) Top-down, bottom-up approaches or a combination of both
a) Top-down: Starts with overall design and planning (mature)
b)Bottom-up: Starts with experiments and prototypes (rapid)
2) From software engineering point of view
a) Waterfall: structured and systematic analysis at each step before proceeding to the next
b) Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around
3) Typical data warehouse design process
a) Choose a business process to model, e.g., orders, invoices, etc.
b) Choose the grain (atomic level of data) of the business process
c) Choose the dimensions that will apply to each fact table record
d) Choose the measure that will populate each fact table record
Data warehouse implementation
Efficient Data Cube Computation
1) Data cube can be viewed as a lattice of cuboids
a) The bottom-most cuboid is the base cuboid
b) The top-most cuboid (apex) contains only one cell
c) How many cuboids in an n-dimensional cube with L levels?
2) Materialization of data cube
a) Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
b) Selection of which cuboids to materialize
c) Based on size, sharing, access frequency, etc.
Indexing OLAP Data: Bitmap Index
1) Index on a particular column
2) Each value in the column has a bit vector: bit-op is fast
3) The length of the bit vector: # of records in the base table
4) The i-th bit is set if the i-th row of the base table has the value for the indexed column
5) Not suitable for high cardinality domains
Indexing OLAP Data: Join Indices
1) Traditional indices map the values to a list of record ids
a) It materializes relational join in JI file and speeds up relational join — a rather costly operation
2) In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.
i) E.g. fact table: Sales and two dimensions city and product
a) A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city
b) Join indices can span multiple dimensions
Efficient Processing OLAP Queries
1) Determine which operations should be performed on the available cuboids:
a) Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection
2) Determine to which materialized cuboids(s) the relevant operations should be applied.
Metadata Repository
1) Meta data is the data defining warehouse objects. It has the following kinds
i) Description of the structure of the warehouse
a) schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
ii) Operational meta-data
a) data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
iii) The algorithms used for summarization
iv) The mapping from operational environment to the data warehouse
v) Data related to system performance
a) Warehouse schema, view and derived data definitions
vi) Business data
a) business terms and definitions, ownership of data, charging policies
Data Warehouse Back-End Tools and Utilities
1) Data extraction:
a) get data from multiple, heterogeneous, and external sources
2) Data cleaning:
a) detect errors in the data and rectify them when possible
3) Data transformation:
a) convert data from legacy or host format to warehouse format
4) Load:
a) sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions
5) Refresh
a) propagate the updates from the data sources to the warehouse
Data Warehousing and OLAP Technology for Data Mining?
1) What is a data warehouse?
1)Defined in many different ways, but not rigorously.
2)A decision support database that is maintained separately from the organization’s operational database.
3)Support information processing by providing a solid platform of consolidated, historical data for analysis.
4)A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
Data Warehouse—Subject-Oriented
1) Organized around major subjects, such as customer, product, sales.
2) Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
3) Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
Data Warehouse—Integrated
1) Constructed by integrating multiple, heterogeneous data sources
2) relational databases, flat files, on-line transaction records
3) Data cleaning and data integration techniques are applied.
4) Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
5) When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant
1) The time horizon for the data warehouse is significantly longer than that of operational systems.
2) Operational database: current value data.
3) Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
4) Every key structure in the data warehouse
5) Contains an element of time, explicitly or implicitly
6) But the key of operational data may or may not contain “time element”.
Data Warehouse—Non-Volatile
1) A physically separate store of data transformed from the operational environment.
2) Operational update of data does not occur in the data warehouse environment.
a) Does not require transaction processing, recovery, and concurrency control mechanisms
b) Requires only two operations in data accessing:
initial loading of data and access of data.
Data Warehouse vs. Operational DBMS
1)OLTP (on-line transaction processing)
a) Major task of traditional relational DBMS
b) Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
2) OLAP (on-line analytical processing)
Why Separate Data Warehouse?
1) High performance for both systems
a) DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
b) Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
2) Different functions and different data:
a) missing data: Decision support requires historical data which operational DBs do not typically maintain
b) data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
c) data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
---------------------------------------------------------------------
2) A multi-dimensional data model
From Tables and Spreadsheets to Data Cubes
1) A data warehouse is based on a multidimensional data model which views data in the form of a data cube
2) A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
a) Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
b) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
a) Major task of data warehouse system
b) Data analysis and decision making
Typical OLAP Operations
1) Roll up (drill-up): summarize data
a) by climbing up hierarchy or by dimension reduction
2) Drill down (roll down): reverse of roll-up
a) from higher level summary to lower level summary or detailed data, or introducing new dimensions
3) Slice and dice:
a) project and select
4) Pivot (rotate):
a) reorient the cube, visualization, 3D to series of 2D planes.
5) Other operations
a) drill across: involving (across) more than one fact table
b) drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
Star and snowflake schemas?
Star and snowflake schema designs are mechanisms to separate facts and dimensions into separate tables. Snowflake schemas further separate the different levels of a hierarchy into separate tables. In either schema design, each table is related to another table with a primary key/foreign key relationship. Primary key/foreign key relationships are used in relational databases to define many-to-one relationships between tables.
Primary keys
A primary key is a column or a set of columns in a table whose values uniquely identify a row in the table. A relational database is designed to enforce the uniqueness of primary keys by allowing only one row with a given primary key value in a table.
Foreign keys
A foreign key is a column or a set of columns in a table whose values correspond to the values of the primary key in another table. In order to add a row with a given foreign key value, there must exist a row in the related table with the same primary key value.
The primary key/foreign key relationships between tables in a star or snowflake schema, sometimes called many-to-one relationships, represent the paths along which related tables are joined together in the RDBMS. These join paths are the basis for forming queries against historical data. For more information about many-to-one relationships, see Many-to-one relationships.
Fact tables
A fact table is a table in a star or snowflake schema that stores facts that measure the business, such as sales, cost of goods, or profit. Fact tables also contain foreign keys to the dimension tables. These foreign keys relate each row of data in the fact table to its corresponding dimensions and levels.
Dimension Tables
A dimension table is a table in a star or snowflake schema that stores attributes that describe aspects of a dimension. For example, a time table stores the various aspects of time such as year, quarter, month, and day. A foreign key of a fact table references the primary key in a dimension table in a many-to-one relationship.
Star schemas
The following figure shows a star schema with a single fact table and four dimension tables. A star schema can have any number of dimension tables. The crow's feet at the end of the links connecting the tables indicate a many-to-one relationship between the fact table and each dimension table.
Snowflake schemas
The following figure shows a snowflake schema with two dimensions, each having three levels. A snowflake schema can have any number of dimensions and each dimension can have any number of levels.
Review Questions in data mining?
1- What is data mining? In your answer, address the following:
Answer Data mining refers the process or method that extracts or "mines" interesting knowledge or patterns from large amounts of data.
2- Is it another hype?
Answer: Data mining is not another hype. Instead, the need for data mining has arisen due to the wide availability of huge amounts of data and the imminent need for turning such data, into useful information and knowledge. Thus, data mining can be viewed as the result of the natural evolution of information technology.
3- Is it a simple transformation of technology developed from databases, statistics, and machine learning?
Answer:
No. Data mining is more than a simple transformation of technology developed from databases, statistics, and machine learning. Instead, data mining involves an integration, rather than a simple transformation, of techniques from multiple, disciplines such as database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis.
4- Explain how the evolution of database technology led to data mining.
Answer:
Database technology began with the development of data collection and database creation mechanisms that, led to the development of effective mechanisms for data management including data storage and retrieval, and query and transaction processing. The large number of database systems offering query and transaction processing eventually and naturally led to the need for data analysis and understanding. Hence, data mining began its development out of this necessity.
5- Describe the steps involved in data mining when viewed as a process of knowledge discovery.
Answer:
The steps involved in data mining when viewed as a process of knowledge discovery arc as follows:
• Data cleaning, a process that removes or transforms noise and inconsistent data
• Data integration, where multiple data sources may be combined
• Data selection, where data relevant to the analysis task are retrieved from the database
* Data transformation, where data are transformed or consolidated into forms appropriate for mining
• Data mining, an essential process where intelligent and efficient methods are applied in order to extract, patterns
• Pattern evaluation, a process that identifies the truly interesting patterns representing knowledge leased on some interestingness measures
• Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user
Practice Test
1. Present an example where data mining is crucial to the success of a business. What data mining functions does this business need? Can they be performed alternatively by data query processing or simple statistical analysis?
2. Suppose your task as a software engineer at Big-University is to design a data mining system to examine their university course database, which contains the following information: the name, address, and status (e.g., undergraduate or graduate) of each student, the courses taken, and their cumulative grade point average (GPA). Describe the architecture you would choose. What is the purpose of each component of this architecture?
3. How is a data warehouse different from a database? How are they similar?
What are the issues in Data Mining?
Data mining algorithms embody techniques that have sometimes existed for many years, but have only lately been applied as reliable and scalable tools that time and again outperform older classical statistical methods. While data mining is still in its infancy, it is becoming a trend and ubiquitous. Before data mining develops into a conventional, mature and trusted discipline, many still pending issues have to be addressed. Some of these issues are addressed below. Note that these issues are not exclusive and are not ordered in any way.
Security and social issues: Security is an important issue with any data collection that is shared and/or is intended to be used for strategic decision-making. In addition, when data is collected for customer profiling, user behaviour understanding, correlating personal data with other information, etc., large amounts of sensitive and private information about individuals or companies is gathered and stored. This becomes controversial given the confidential nature of some of this data and the potential illegal access to the information. Moreover, data mining could disclose new implicit knowledge about individuals or groups that could be against privacy policies, especially if there is potential dissemination of discovered information. Another issue that arises from this concern is the appropriate use of data mining. Due to the value of data, databases of all sorts of content are regularly sold, and because of the competitive advantage that can be attained from implicit knowledge discovered, some important information could be withheld, while other information could be widely distributed and used without control.
User interface issues: The knowledge discovered by data mining tools is useful as long as it is interesting, and above all understandable by the user. Good data visualization eases the interpretation of data mining results, as well as helps users better understand their needs. Many data exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriate visual presentation. There are many visualization ideas and proposals for effective data graphical presentation. However, there is still much research to accomplish in order to obtain good visualization tools for large datasets that could be used to display and manipulate mined knowledge. The major issues related to user interfaces and visualization are "screen real-estate", information rendering, and interaction. Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks, as well as to picture the discovered knowledge from different angles and at different conceptual levels.
Mining methodology issues: These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of the mining approaches, the diversity of data available, the dimensionality of the domain, the broad analysis needs (when known), the assessment of the knowledge discovered, the exploitation of background knowledge and metadata, the control and handling of noise in data, etc. are all examples that can dictate mining methodology choices. For instance, it is often desirable to have different data mining methods available since different approaches may perform differently depending upon the data at hand. Moreover, different approaches may suit and solve user's needs differently.
Most algorithms assume the data to be noise-free. This is of course a strong assumption. Most datasets contain exceptions, invalid or incomplete information, etc., which may complicate, if not obscure, the analysis process and in many cases compromise the accuracy of the results. As a consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often seen as lost time, but data cleaning, as time-consuming and frustrating as it may be, is one of the most important phases in the knowledge discovery process. Data mining techniques should be able to handle noise in data or incomplete information.
More than the size of data, the size of the search space is even more decisive for data mining techniques. The size of the search space is often depending upon the number of dimensions in the domain space. The search space usually grows exponentially when the number of dimensions increases. This is known as the curse of dimensionality. This "curse" affects so badly the performance of some data mining approaches that it is becoming one of the most urgent issues to solve.
What can be discovered?
The kinds of patterns that can be discovered depend upon the data mining tasks employed. By and large, there are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data, and predictive data mining tasks that attempt to do predictions based on inference on available data.
The data mining functionalities and the variety of knowledge they discover are briefly presented in the following list:
• Characterization: Data characterization is a summarization of general features of objects in a target class, and produces what is called characteristic rules. The data relevant to a user-specified class are normally retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstractions. For example, one may want to characterize the OurVideoStore customers who regularly rent more than 30 movies a year. With concept hierarchies on the attributes describing the target class, the attribute-oriented induction method can be used, for example, to carry out data summarization. Note that with a data cube containing summarization of data, simple OLAP operations fit the purpose of data characterization.
• Discrimination: Data discrimination produces what are called discriminant rules and is basically the comparison of the general features of objects between two classes referred to as the target class and the contrasting class. For example, one may want to compare the general characteristics of the customers who rented more than 30 movies in the last year with those whose rental account is lower than 5. The techniques used for data discrimination are very similar to the techniques used for data characterization with the exception that data discrimination results include comparative measures.
• Association analysis: Association analysis is the discovery of what are commonly called association rules. It studies the frequency of items occurring together in transactional databases, and based on a threshold called support, identifies the frequent item sets. Another threshold, confidence, which is the conditional probability than an item appears in a transaction when another item appears, is used to pinpoint association rules. Association analysis is commonly used for market basket analysis. For example, it could be useful for the OurVideoStore manager to know what movies are often rented together or if there is a relationship between renting a certain type of movies and buying popcorn or pop. The discovered association rules are of the form: P -> Q [s,c], where P and Q are conjunctions of attribute value-pairs, and s (for support) is the probability that P and Q appear together in a transaction and c (for confidence) is the conditional probability that Q appears in a transaction when P is present. For example, the hypothetic association rule:
RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop") [s=2% ,c=55%]
would indicate that 2% of the transactions considered are of customers aged between 13 and 19 who are renting a game and buying a pop, and that there is a certainty of 55% that teenage customers who rent a game also buy pop.
• Classification: Classification analysis is the organization of data in given classes. Also known as supervised classification, the classification uses given class labels to order the objects in the data collection. Classification approaches normally use a training set where all objects are already associated with known class labels. The classification algorithm learns from the training set and builds a model. The model is used to classify new objects. For example, after starting a credit policy, the OurVideoStore managers could analyze the customers’ behaviours vis-à-vis their credit, and label accordingly the customers who received credits with three possible labels "safe", "risky" and "very risky". The classification analysis would generate a model that could be used to either accept or reject credit requests in the future.
• Prediction: Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends, or predict a class label for some data. The latter is tied to classification. Once a classification model is built based on a training set, the class label of an object can be foreseen based on the attribute values of the object and the attribute values of the classes. Prediction is however more often referred to the forecast of missing numerical values, or increase/ decrease trends in time related data. The major idea is to use a large number of past values to consider probable future values.
• Clustering: Similar to classification, clustering is the organization of data in classes. However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes. Clustering is also called unsupervised classification, because the classification is not dictated by given class labels. There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes (inter-class similarity).
• Outlier analysis: Outliers are data elements that cannot be grouped in a given class or cluster. Also known as exceptions or surprises, they are often very important to identify. While outliers can be considered noise and discarded in some applications, they can reveal important knowledge in other domains, and thus can be very significant and their analysis valuable.
• Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of time related data that changes in time. Evolution analysis models evolutionary trends in data, which consent to characterizing, comparing, classifying or clustering of time related data. Deviation analysis, on the other hand, considers differences between measured values and expected values, and attempts to find the cause of the deviations from the anticipated values.
It is common that users do not have a clear idea of the kind of patterns they can discover or need to discover from the data at hand. It is therefore important to have a versatile and inclusive data mining system that allows the discovery of different kinds of knowledge and at different levels of abstraction. This also makes interactivity an important attribute of a data mining system.
Is all that is discovered interesting and useful?
Data mining allows the discovery of knowledge potentially useful and unknown. Whether the knowledge discovered is new, useful or interesting, is very subjective and depends upon the application and the user. It is certain that data mining can generate, or discover, a very large number of patterns or rules. In some cases the number of rules can reach the millions. One can even think of a meta-mining phase to mine the oversized data mining results. To reduce the number of patterns or rules discovered that have a high probability to be non-interesting, one has to put a measurement on the patterns. However, this raises the problem of completeness. The user would want to discover all rules or patterns, but only those that are interesting. The measurement of how interesting a discovery is, often called interestingness, can be based on quantifiable objective elements such as validity of the patterns when tested on new data with some degree of certainty, or on some subjective depictions such as understandability of the patterns, novelty of the patterns, or usefulness.
Discovered patterns can also be found interesting if they confirm or validate a hypothesis sought to be confirmed or unexpectedly contradict a common belief. This brings the issue of describing what is interesting to discover, such as meta-rule guided discovery that describes forms of rules before the discovery process, and interestingness refinement languages that interactively query the results for interesting patterns after the discovery phase. Typically, measurements for interestingness are based on thresholds set by the user. These thresholds define the completeness of the patterns discovered.
Identifying and measuring the interestingness of patterns and rules discovered, or to be discovered, is essential for the evaluation of the mined knowledge and the KDD process as a whole. While some concrete measurements exist, assessing the interestingness of discovered knowledge is still an important research issue.
What kind of Data can be mined?
In principle, data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object-oriented databases, data warehouses, transactional databases, unstructured and semi-structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. Here are some examples in more detail:
• Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research level. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. The data in these files can be transactions, time-series data, scientific measurements, etc.
• Relational Databases: Briefly, a relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key. In Figure 1.2 we present some relations Customer, Items, and Borrow representing business activity in a fictitious video store OurVideoStore. These relations are just a subset of what could be a database for the video store and is given as an example.
The most commonly used query language for relational database is SQL, which allows retrieval and manipulation of the data stored in the tables, as well as the calculation of aggregate functions such as average, sum, min, max and count. For instance, an SQL query to select the videos grouped by category would be:
SELECT count(*) FROM Items WHERE type=video GROUP BY category.
Data mining algorithms using relational databases can be more versatile than data mining algorithms specifically written for flat files, since they can take advantage of the structure inherent to relational databases. While data mining can benefit from SQL for data selection, transformation and consolidation, it goes beyond what SQL could provide, such as predicting, comparing, detecting deviations, etc.
• Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema. A data warehouse gives the option to analyze data from different sources under the same roof. Let us suppose that OurVideoStore becomes a franchise in North America. Many video stores belonging to OurVideoStore company may have different databases and different structures. If the executive of the company wants to access the data from all stores for strategic decision-making, future direction, marketing, etc., it would be more appropriate to store all the data in one site with a homogeneous structure that allows interactive analysis. In other words, data from the different stores would be loaded, cleaned, transformed and integrated together. To facilitate decision-making and multi-dimensional views, data warehouses are usually modeled by a multi-dimensional data structure. Figure 1.3 shows an example of a three dimensional subset of a data cube structure used for OurVideoStore data warehouse.
The figure shows summarized rentals grouped by film categories, then a cross table of summarized rentals by film categories and time (in quarters). The data cube gives the summarized rentals along three dimensions: category, time, and city. A cube contains cells that store values of some aggregate measures (in this case rental counts), and special cells that store summations along dimensions. Each dimension of the data cube contains a hierarchy of values for one attribute.
Because of their structure, the pre-computed summarized data they contain and the hierarchical attribute values of their dimensions, data cubes are well suited for fast interactive querying and analysis of data at different conceptual levels, known as On-Line Analytical Processing (OLAP). OLAP operations allow the navigation of data at different levels of abstraction, such as drill-down, roll-up, slice, dice, etc. Figure 1.4 illustrates the drill-down (on the time dimension) and roll-up (on the location dimension) operations.
• Transaction Databases: A transaction database is a set of records representing transactions, each with a time stamp, an identifier and a set of items. Associated with the transaction files could also be descriptive data for the items. For example, in the case of the video store, the rentals table such as shown in Figure 1.5, represents the transaction database. Each record is a rental contract with a customer identifier, a date, and the list of items rented (i.e. video tapes, games, VCR, etc.). Since relational databases do not allow nested tables (i.e. a set as attribute value), transactions are usually stored in flat files or stored in two normalized transaction tables, one for the transactions and one for the transaction items. One typical data mining analysis on such data is the so-called market basket analysis or association rules in which associations between items occurring together or in sequence are studied.
• Multimedia Databases: Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system. Multimedia is characterized by its high dimensionality, which makes data mining even more challenging. Data mining from multimedia repositories may require computer vision, computer graphics, image interpretation, and natural language processing methodologies.
• Spatial Databases: Spatial databases are databases that, in addition to usual data, store geographical information like maps, and global or regional positioning. Such spatial databases present new challenges to data mining algorithms.
• Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis. Data mining in such databases commonly includes the study of trends and correlations between evolutions of different variables, as well as the prediction of trends and movements of the variables in time. Figure 1.7 shows some examples of time-series data.
• World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis, and a massive number of users are accessing its resources daily. Data in the World Wide Web is organized in inter-connected documents. These documents can be text, audio, video, raw data, and even applications. Conceptually, the World Wide Web is comprised of three major components: The content of the Web, which encompasses documents available; the structure of the Web, which covers the hyperlinks and the relationships between documents; and the usage of the web, describing how and when the resources are accessed. A fourth dimension can be added relating the dynamic nature or evolution of the documents. Data mining in the World Wide Web, or web mining, tries to address all these issues and is often divided into web content mining, web structure mining and web usage mining.
What are Data Mining and Knowledge Discovery?

What are Data Mining and Knowledge Discovery?
With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making.
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process. The following figure (Figure 1.1) shows data mining as a step in an iterative knowledge discovery process.
The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps:
• Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source.
• Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data collection.
• Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure.
• Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures.
• Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.
It is common to combine some of these steps together. For instance, data cleaning and data integration can be performed together as a pre-processing phase to generate a data warehouse. Data selection and data transformation can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data warehouses, the selection is done on transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results.
Data mining derives its name from the similarities between searching for valuable information in a large database and mining rocks for a vein of valuable ore. Both imply either sifting through a large amount of material or ingeniously probing the material to exactly pinpoint where the values reside. It is, however, a misnomer, since mining for gold in rocks is usually called "gold mining" and not "rock mining", thus by analogy, data mining should have been called "knowledge mining" instead. Nevertheless, data mining became the accepted customary term, and very rapidly a trend that even overshadowed more general terms such as knowledge discovery in databases (KDD) that describe a more complete process. Other similar terms referring to data mining are: data dredging, knowledge extraction and pattern discovery.

Introduction to Data Mining

Chapter I: Introduction to Data Mining
We are in an age often referred to as the information age. In this information age, because we believe that information leads to power and success, and thanks to sophisticated technologies such as computers, satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of computers and means for mass digital storage, we started collecting and storing all sorts of data, counting on the power of computers to help sort through this amalgam of information. Unfortunately, these massive collections of data stored on disparate structures very rapidly became overwhelming. This initial chaos has led to the creation of structured databases and database management systems (DBMS). The efficient database management systems have been very important assets for management of a large corpus of data and especially for effective and efficient retrieval of particular information from a large collection whenever needed. The proliferation of database management systems has also contributed to recent massive gathering of all sorts of information. Today, we have far more information than we can handle: from business transactions and scientific data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough anymore for decision-making. Confronted with huge collections of data, we have now created new needs to help us make better managerial choices. These needs are automatic summarization of data, extraction of the "essence" of information stored, and the discovery of patterns in raw data.
1)What kind of information are we collecting?
We have been collecting a myriad of data, from simple numerical measurements and text documents, to more complex information such as spatial data, multimedia channels, and hypertext documents. Here is a non-exclusive list of a variety of information collected in digital form in databases and in flat files.
• A)Business transactions: Every transaction in the business industry is (often) "memorized" for perpetuity. Such transactions are usually time related and can be inter-business deals such as purchases, exchanges, banking, stock, etc., or intra-business operations such as management of in-house wares and assets. Large department stores, for example, thanks to the widespread use of bar codes, store millions of transactions daily representing often terabytes of data. Storage space is not the major problem, as the price of hard disks is continuously dropping, but the effective use of the data in a reasonable time frame for competitive decision-making is definitely the most important problem to solve for businesses that struggle to survive in a highly competitive world.
• B)Scientific data: Whether in a Swiss nuclear accelerator laboratory counting particles, in the Canadian forest studying readings from a grizzly bear radio collar, on a South Pole iceberg gathering data about oceanic activity, or in an American university investigating human psychology, our society is amassing colossal amounts of scientific data that need to be analyzed. Unfortunately, we can capture and store more new data faster than we can analyze the old data already accumulated.
• C)Medical and personal data: From government census to personnel and customer files, very large collections of information are continuously gathered about individuals and groups. Governments, companies and organizations such as hospitals, are stockpiling very important quantities of personal data to help them manage human resources, better understand a market, or simply assist clientele. Regardless of the privacy issues this type of data often reveals, this information is collected, used and even shared. When correlated with other data this information can shed light on customer behaviour and the like.
• D) Surveillance video and pictures: With the amazing collapse of video camera prices, video cameras are becoming ubiquitous. Video tapes from surveillance cameras are usually recycled and thus the content is lost. However, there is a tendency today to store the tapes and even digitize them for future use and analysis.
• E) Satellite sensing: There is a countless number of satellites around the globe: some are geo-stationary above a region, and some are orbiting around the Earth, but all are sending a non-stop stream of data to the surface. NASA, which controls a large number of satellites, receives more data every second than what all NASA researchers and engineers can cope with. Many satellite pictures and data are made public as soon as they are received in the hopes that other researchers can analyze them.
• F) Games: Our society is collecting a tremendous amount of data and statistics about games, players and athletes. From hockey scores, basketball passes and car-racing lapses, to swimming times, boxer’s pushes and chess positions, all the data are stored. Commentators and journalists are using this information for reporting, but trainers and athletes would want to exploit this data to improve performance and better understand opponents.
• G) Digital media: The proliferation of cheap scanners, desktop video cameras and digital cameras is one of the causes of the explosion in digital media repositories. In addition, many radio stations, television channels and film studios are digitizing their audio and video collections to improve the management of their multimedia assets. Associations such as the NHL and the NBA have already started converting their huge game collection into digital forms.
• H) CAD and Software engineering data: There are a multitude of Computer Assisted Design (CAD) systems for architects to design buildings or engineers to conceive system components or circuits. These systems are generating a tremendous amount of data. Moreover, software engineering is a source of considerable similar data with code, function libraries, objects, etc., which need powerful tools for management and maintenance.
• I) Virtual Worlds: There are many applications making use of three-dimensional virtual spaces. These spaces and the objects they contain are described with special languages such as VRML. Ideally, these virtual spaces are described in such a way that they can share objects and places. There is a remarkable amount of virtual reality object and space repositories available. Management of these repositories as well as content-based search and retrieval from these repositories are still research issues, while the size of the collections continues to grow.
• J) Text reports and memos (e-mail messages): Most of the communications within and between companies or research organizations or even private people, are based on reports and memos in textual forms often exchanged by e-mail. These messages are regularly stored in digital form for future use and reference creating formidable digital libraries.
• K) The World Wide Web repositories: Since the inception of the World Wide Web in 1993, documents of all sorts of formats, content and description have been collected and inter-connected with hyperlinks making it the largest repository of data ever built. Despite its dynamic and unstructured nature, its heterogeneous characteristic, and its very often redundancy and inconsistency, the World Wide Web is the most important data collection regularly used for reference because of the broad variety of topics covered and the infinite contributions of resources and publishers. Many believe that the World Wide Web will become the compilation of human knowledge
Thursday, May 22, 2008
TECHNOLOGY
Gamer anger at Nokia's 'lock in'
![]() Nokia relaunched the N-Gage service last month |
Gamers have hit out at Nokia after learning that N-Gage titles bought for their handsets are locked to that specific device forever.
If a gamer changes or upgrades to a different Nokia handset they have to purchase the games again if they want to continue playing.
The issue was uncovered by website All About N-Gage.
"It's a bad idea for everyone... the N-Gage platform, gamers and third party publishers," the site said.
Nokia said it had made the decision to prevent piracy and to ensure its "partners receive their rightful revenues from our platform".
Hidden catch
Nokia relaunched its N-Gage mobile gaming platform last month.
About 30 games are available on a limited range of Nokia handsets, which are bought and downloaded direct to the phone.
It is the company's second attempt at making mobile gaming a success. In 2003 it released a dedicated handset for gaming, but the device never took off.
Ahead of the latest launch, Jaakko Kaidesoja from Nokia's Play New Experience division, told BBC News: "One of the best things we learned from the original N-Gage is that you can create a community and people appreciate the connectivity."
But the new platform has provoked anger amongst gamers.
Writing on the official N-Gage forums, one gamer said: "Changes need to be made soon, and sticking one's head in the sand will not change anybody's mind."
When gamers sign up for the service they have to agree to terms and conditions, part of which explains that games cannot be transferred between devices.
It states: "Content shall be... limited to one private installation on one N-Gage compatible Nokia device only."
But gamers have complained that the detail is buried in the terms and conditions and it is not clear enough at the point of purchase.
A statement from Nokia said: "Our policy is that the N-Gage activation codes only work on the device where they were first activated.
"As with any digital media there is a potential risk of piracy and this policy is one of the ways we are dealing with piracy and ensuring our partners receive their rightful revenues from our platform.
"If users need to repair their device, the activation codes will be reissued."
2)
Number keys promise safer data
![]() The system hinges on multiple keys for multiple items |
Mathematicians at the University of California in Los Angeles have applied a fundamental rethink to improve the "one lock - one key" method that current encryption technologies such as RSA and AES operate on.
Amit Sahai, associate professor at UCLA, told BBC World Service's Digital Planet programme that they had decided to "rebuild the idea from the ground up," and developed the idea of multiple keys giving access to selected pieces of data.
"In our vision, we'll have some data that can be locked - but now that one lock is openable by many different keys in many different ways," he explained.
Key management
Currently, when information is encrypted, it is secured with a digital lock and key created together.
While this works well for individual computers, it presents problems on an industrial scale because company data has to be stored on large servers and accessed by large numbers of people.
The UCLA mathematicians point out that this leads to a big problem in terms of key management.
"That key management problem, of needing so many different keys to have access to all the files they should be able to have access to... is so complicated that they just don't use encryption," Dr Sahai said.
"Encryption is essentially not used by most large corporations, and to the extent that it is used, it is used incorrectly or in a silly way."
And in many systems, the key is put on the same server that holds the encrypted data.
![]() Access to medical records could become much more sophisticated |
He said that a good example of how his system could work is a person's medical records. Whereas currently access to the records is on an all-or-nothing basis, the advanced encryption would allow different amounts of access according to a person's relation to the patient.
Dieticians would be able to see blood sugar levels, while oncologists can see cancer reports.
"Similarly, many different people - depending on who they are and what their position is - should be able to access many different aspects of my medical records," Dr Sahai said.
"What we want to do - and what we've done, to some extent - is to have a mathematical encryption scheme where you encrypt your medical record once, and then different people with different keys can open it in different ways."
Engineering
Doing this with existing technology would mean all different aspects of data would have to be separately encrypted.
Meanwhile, Dr Sahai said that the "clever thing" about his system was that it was approached the problem using maths, rather than just as a data problem.
"We're trying to take some of the very difficult job that we give to the security engineer and actually put it into the mathematics itself," he said.
"Once you have this kind of expressibility in the mathematics itself, it makes the job of the security engineer that much easier - because the mathematics is protecting you.3)
Bright future predicted for Apple
- Darren Waters
- 22 May 08, 10:30 GMT
Analysts like to make waves. After all, if what they say lacks impact, then no-one pays attention.
So how about this prediction from Forrester: "Apple Inc. will become the hub of the digital home by 2013."
Forrester says Apple will evolve an "integrated digital experience" based on eight pillars.
Four of them you will probably recognise:
The Mac, Apple TV, the Apple store (the physical shop), iTunes.
Four of them are, ahem, guesswork from Forrester:
Apple home server product, AppleSound universal music controller, network-enabled gadgets (ie music, digital photo frame and alarm clock devices) and in-home installation services.
Now, I can certainly believe that Apple is working on a home server product, that's not really a big prediction. It's merely an extension of Apple TV and the Time Capsule wireless storage device it already ships.
But an AppleSound universal music controller? Do they mean a remote control? I'm not even sure why this is needed.
And can anyone else envisage Apple selling digital photo frames or alarm clocks? Nope, me neither.
And Apple offering in-home installation services? Erm, isn't the whole point of Apple's products that you don't need professional installation help? And what would people be installing exactly?
These predictions strike me as off key for a number of reasons:
1. I don't see Apple displacing satellite and cable firms so radically. In fact, I see more disruption of Apple's business by set-top box providers than the other way around.
2. Apple TV remains a work in progress and hasn't proved its potential.
3. Content providers are now very wary of doing deals with iTunes that leave them at the mercy of Steve Jobs. The music industry is doing everything in its power to break iTunes' hold. The film and TV industry won't make the same mistake
4. Open standards will triumph. I don't believe that "lock in" systems will ever work as the glue between our devices.
5. I don't think one company will ever be the hub. Interoperability will mean that we can cherry pick our devices and our content will run between them all.
4)
Sceptics question Microsoft move
![]() Office is the dominant productivity suite of programs |
Open source advocates have questioned Microsoft's commitment to using open document standards in the future.
The computer giant has said it will implement use of the Open Document Format (ODF), "sometime next year".
The Free Software Foundation Europe said: "It's a step in the right direction but we are sceptical about how open Microsoft will be."
The European Commission, which has fined Microsoft for monopolistic practice, welcomed the move.
"The Commission would welcome any step that Microsoft took towards genuine interoperability, more consumer choice and less vendor lock-in," it said.
The Commission added that it would look into whether Microsoft's announcement "leads to better interoperability and allows consumers to process and exchange their documents with the software product of their choice".
![]() | ![]() ![]() Marino Marcich, ODF Alliance |
Open source software advocates have long criticised the file formats used by Microsoft's Office suite of programs because they are not genuinely interoperable with software from third parties.
Microsoft has said it will add support for ODF when it updates Office 2007 next year.
Georg Greve, president of the Free Software Foundation Europe, said he remained dubious about "how deep" Microsoft's adoption of the standard would go.
'Right direction'
"This is definitely a step in the right direction. We have been encouraging Microsoft to support ODF natively for quite a while.
"Like all things, this will depend to some extent on how they do it."
The Open Document Format Alliance said it was sceptical about the extent of Microsoft's commitment.
![]() | ![]() ![]() Georg Greve, FSF Europe |
Marino Marcich, managing director of the ODF Alliance, said: "The proof will be whether and when Microsoft's promised support for ODF is on par with its support for its own formats.
"Governments will be looking for actual results, not promises in press releases."
At the moment, Office users can use ODF documents by using a downloaded "translator" program.
Critics point out
But critics have pointed out that the translator does not integrate very well with parts of the Office suite.
The move by Microsoft follows attempts by the company to have its own standard, the OpenXML format, recognised as interoperable.
The International Standards Organisation approved its use but the full specification of the OpenXML format has yet to be published.
Mr Greve said: "Support for ODF indicates there are problems with OpenXML that Microsoft cannot resolve easily and quickly.
"OpenXML is something all users want to stay away from. It's not clear if it will ever become an interoperable standard and so users should be very careful using it."
Mr Greve said "genuine adoption" of ODF would give consumers more choice.
'Full choice'
"People will no longer need to use Microsoft Office in order to interoperate.
"They will no longer need to choose a support platform for Office, i.e. Windows."
He added: "There will be full choice on the desktop; people could switch to Linux and choose Open Office or other applications that support ODF, like Lotus Symphony or Google Docs.
"There is fairly large amount of apps to choose from, which can be based on the merits of the application and their personal preference.Posted: 03:16 PM ET
Imagine the anticipation of a countdown before rocket engines roar to life. Smoke billows, and it’s three G’s and eight-and-a-half minutes to space.
After you slip the surly bonds, you float over to the window and gaze wide-eyed at the majesty of Planet Earth. Perhaps you’d spot the Great Wall of China, or even a big hurricane. I’d have Bowie’s “Ground Control to Major Tom” playing on my iPod.
Spaceflight tickles the imagination. It’s the stuff of heroes and explorers. We remain in awe of the cosmos, and amazed at each incremental step toward the infinite.
Source: NASA
Now take a look at this photo. The folks at Johnson Space Center in Houston sent this picture to me today. Not exactly what you imagine while reading Jules Verne or Arthur Clarke. It might be the NASA equivalent of witnessing hot dogs in the making.
You’re looking at a test chamber scaled to be the size of the Orion crew capsule. Orion, of course, is NASA’s next-gen exploration vehicle. It will carry crew and cargo to the space station and on to the moon.
The umbrella name for the entire program is Constellation, and the space agency is hoping to launch the first manned mission by 2015.
The chamber is the size of a walk-in closet - about 570 cubic feet - and the people sitting inside are volunteers recruited to test a lunar breathing system called CAMRAS. (NASA likes its acronyms!) It stands for Carbon-dioxide and Moisture Removal Amine Swing-bed. Go figure.
But imagine sitting for eight hours in this thing with five other people you just met? Twenty-three volunteers did just that for a series of tests over a three-week period last month. The point: to breathe and sweat. Sounds like the perfect job for an executive producer!
Seriously though, NASA has to measure the amount of moisture and carbon dioxide absorbed by the system so Orion crews can breathe easily and live comfortably in space. Volunteers were asked to sleep, eat and exercise in the chamber. Some test sessions lasted a few hours and others were overnight.
CAMRAS uses very little energy. An organic compound called amine absorbs the CO2 and water vapor from the cabin. And when the system vents the waste overboard, the vacuum of space regenerates the amine. Think of the venting as wringing out a dirty sponge.
For more on the test and NASA’s Constellation Program, visit www.nasa.gov/constellation.
NASA delays Hubble mission to fix shuttle fuel tanks
CAPE CANAVERAL, Florida (AP) -- NASA's final visit to the Hubble Space Telescope has been delayed at least a month, until the fall, because of extra time needed to build the shuttle fuel tanks needed for the flight and a potential rescue mission.
The Hubble Space Telescope orbits 350 miles above Earth.
Atlantis and a crew of seven were supposed to fly to Hubble at the end of August but now won't make the journey until the end of September or early October.
Shuttle program manager John Shannon said it's taken more time to incorporate all the post-Columbia design changes to the external fuel tanks than had been expected.
"It's a small price to pay, to tell you the truth, four to five weeks for all the improvements that we're getting on this tank," Shannon said Thursday.
The fuel tank for the next shuttle launch is the first to be built from scratch with the design changes. That work delayed Discovery's flight to the international space station from April until May 31.
The mission to Hubble, orbiting 350 miles above Earth, is unique. Not only must Atlantis be ready, another shuttle must be on the launch pad ready to rush to the rescue in case Atlantis suffers severe launch damage that might prevent a safe re-entry.
Unlike other shuttle crews, which travel to the space station, the astronauts on the Hubble mission would have nowhere to seek shelter in the event of a gaping hole in their ship's thermal shield. In the case of a rescue, the Hubble astronauts would put on spacesuits and float out of their ship and into the other shuttle.
Columbia was destroyed and its seven astronauts killed during re-entry in 2003 because of a plate-size hole in the shuttle's left wing. A chunk of fuel-tank foam insulation broke off during liftoff and gashed it.
Because of the delay in the Hubble mission, NASA will have to settle for five shuttle flights this year instead of six. Despite the setback, NASA still hopes to complete the space station and retire its shuttles in 2010, Shannon said.
As for Russia's trouble-plagued Soyuz re-entry April 19, NASA's space station program manager, Mike Suffredini, said Thursday that the investigation into the mishap will determine whether the three astronauts were at any more risk than normal.
The Soyuz spacecraft descended much more steeply than usual and subjected the crew to considerably more gravity forces. It was the second time in a row that the capsule malfunctioned like this.
The crew included U.S. astronaut Peggy Whitson, who was ending a six-month space station stay, as well as a Russian and a South Korean who ended up in the hospital with back and neck pain.
Russia hopes to complete its investigation by the end of May. With U.S. astronaut Gregory Chamitoff scheduled to fly to the space station aboard Discovery and remain there for several months, NASA will have to decide before May 31 whether the Soyuz will serve as a safe lifeboat if there is an emergency.
Suffredini said it would be "pretty dramatic" for NASA to pull Chamitoff or anyone else off the space station. "But we will do whatever is necessary based on the findings of the commission," he said.
As countries and companies plan to go to the moon
LONDON, England -- One of Francis Williams' favorite stories to tell is about the time he was pulled over for speeding.
As countries and companies plan to go to the moon, a debate heats up on lunar property rights.
Williams, who had been in London on business, was driving home through the English countryside when a police officer stopped him and wanted to know two things: Was Williams aware of how fast he was driving? And, what was his profession?
It turned out the response to the second question would help Williams resolve the first: "I said, 'I sell land on the moon,'" said Williams. "And [the police officer] said, 'Do you know, my wife has bought some of that.'"
The answer to the first question was subsequently forgotten.
Williams, who describes himself as the "Lunar Ambassador to the United Kingdom," is the owner of MoonEstates. He claims to have sold around 300,000 acres of moon land since he and his wife, Sue, founded the Cornwall-based company eight years ago. One-acre plots of lunar turf go for about $40.
As proof of purchase, new property owners receive a silver tin containing a personalized "Lunar Deed" and a moon map with a tiny black X marking their tract's approximate location. Most of the land Williams sells is in the northwest, in an area known as Oceanus Procellarum, or Ocean of Storms -- a desolate lava plain formed by volcanoes billions of years ago. "I know the Japanese are [selling] further east," he said.
Williams received his license to sell lunar land in the UK from Dennis Hope. In 1980, the Nevada-based entrepreneur claimed ownership of the moon after finding what he calls a loophole in the 1967 United Nations Outer Space Treaty, which forbids countries from owning the moon but, according to Hope, does not forbid individuals from owning it.
Hope, who estimates he has sold over 500 million acres of moon land, said he immediately filed a "declaration of ownership" with the U.N. along with the United States and Russian governments.
After 28 years, the moon mogul still has not received a reply. "I have never heard from them on that note ever," Hope told CNN in a phone interview.
While the U.N. may have ignored Hope's lunar land claims for almost three decades, it is unlikely the organization will be able ignore what could soon become a question of increasing international importance: Who, exactly, does own the moon?
"At some point the world community needs to come together and draft some new convention or treaty," said Paul Dempsey, director of the Institute of Air and Space Law and McGill University in Montreal. "It is an open wound that needs to be healed."
Dempsey pointed out that at the time the U.N. drafted the Outer Space Treaty, there were only two spacefaring nations -- the U.S. and the Soviet Union. Now there are over a dozen. And many of them, including China, Russia, the U.S., India and Japan, want to go to the moon.
NASA, for example, recently announced plans to return by 2020, eventually building a permanent base on the lunar surface. The Russian space agency, Roskosmos, has confirmed similar intentions.
The burgeoning commercial space sector is also casting its gaze towards Earth's only natural satellite with companies considering everything from mining the lunar surface to building extraterrestrial resorts on it.
"It is quite a complicated issue because it is international law we are dealing with," said Niklas Hedman, chief of the Committee Services and Research Section of the U.N.'s Office for Outer Space Affairs in Vienna.
There are five treaties that govern international affairs in space, said Hedman. Two of them -- the Outerspace Treaty and the 1979 Moon Agreement -- deal with lunar law.
The Outer Space Treaty provides a legal framework for the international use of space for peaceful purposes, including the moon and other celestial bodies. Widely considered the "Magna Carta of space law," this treaty lays down the fundamental principle of non-appropriation and that the exploration and use of space shall be the province of all mankind.
According to the treaty, states bear international responsibility for national activities in space, including by non-governmental entities. The Outer Space Treaty says governments cannot claim ownership of the lunar surface and that stations and installations on the moon shall be open to others, said Hedman.
The Moon Agreement builds upon the Outer Space Treaty but also says that any natural resources found on the moon are part of "the common heritage of mankind" - in other words, they must be shared.
While 98 nations, including all the major spacefaring ones, have ratified the Outerspace Treaty, only 13 countries have approved the Moon Agreement -- Kazakhstan, Lebanon, Uruguay and Mexico, to name four.
But Hedman said this does not mean the other 179 countries that have not ratified the Moon Agreement are free to make a lunar land rush.
"They are still bound by the fundamental provisions [of the Outer Space Treaty]," he said, adding that "when enough states of the world have ratified a treaty, and it becomes binding, then certain fundamental provisions become binding even on states that have not ratified it."
Henry Hertzfeld, a space analyst at George Washington University's Space Policy Institute, said he is not so sure the U.N.'s treaties provide an adequate answer to the question of lunar property rights.
"These treaties don't really have any teeth to them in terms of enforcement," said Hertzfeld. "They are agreements on principle."
Instead of focusing on who owns the moon, the international community needs find ways to incentivize future business activity on the moon by guaranteeing that rights to land and resources will not be preempted by competing interests, said Hertzfeld.
"Owning property is not the issue, the issue is finding a mechanism for businesses to make a fair return on their investment," he said. "Otherwise there is no point in investing."
But first, Hertzfeld said, there also needs to be a guarantee that there is something on the moon worth investing in at all.
"My feeling is until we know what is there, we shouldn't mess with it," he said.