Abstract In recent years Big Data technology became affordable both financially and in the amount of effort to understand and integrate this technology. Also the trend to make data public provides scientists a abundant source to perform new research. Although the big data technology is relatively easy to set up for people with a IT background, the hurdle for scientists from other domains is often still to high. In addition, the majority of scientists may not have knowledge in the field of Information and Communication technology, but they have a good knowledge of statistics. This project is also moving in this direction providing an infrastucture where scientists can run experiments using geographical datasets. The aim is to develop an open platform to facilitate researchers studying the issue of nuclear radiation and its relation to geographical conditions and weather of a particular area. To extract the results of the research BigDataEurope (BDE) platform will be used with respect the nuclear radiation data shared as open-data from the volunteers community of SafeCast.org . Also, the weather and geographical open source data will be mined from the OpenWeatherMap.org, an IT company experienced and commercialized a relevant platform. This project is about investigating Big-Data technology with respect of increasing the usability of the SafeCast community. This document describes the thesis design of the master project to be performed, internally, in Vrije Universiteit Amsterdam (VU). Thesis design is structured as follows: Section 1 defines the research problem and posing the appropriate research questions to tackle it.Section 2 provides information concerning the theoretical background associated to the state of the art literature to be used in the research problem Section 3 constitutes the methodology applied, where the phases of realizing goal are described. Section 4 provides an illustration of the necessary Time Frame. Abbreviations Full Form KNIME Konstanz Information Miner PAINS Pan-Assay Interference Compounds SMILES Simplified Molecular Input Line Entry System SDF Structure-Data File CSV Comma Separated Values SGC Structural Genomics ConsortiumResearch problem description and research questionsIn this section, the problem that triggered this research will be presented and analyzed extensively, so that it becomes clear what the exact aim of this research is. Nowadays, nuclear radiation and its environmental impact is a widespread speculation. In addition, its impact and effect on the health of the living beings is another critical issue that is a profound controversy. This issue has been concerned with the environmental, the scientific and social community, particularly since the last devastating incident in Fukushima, Japan. It was then, the focal point, that the need for a thorough examination of the issue was more imperative than ever. Until now, finding data that presents radiation, with absolute bias, based on an impartial study of an independent viewpoint and position was, unfortunately, difficult. The request was, therefore, to set up a monitoring and mapping system for emergency radiation measurement to demonstrate to science and to the public of having an independent organization devoted solely to providing the most accurate and credible data possible. From the outset, SafeCast.org was founded on these particular standards. The SafeCast.org community is a collaborative global volunteer-centered citizen science project working to empower people with data about their environments. It constitutes an effort of volunteers to measure nuclear radiation and shares it as an open-data by putting data and data collection know-how in the hands of people worldwide. Begun in response to the nuclear disaster in Japan in March 2011, SafeCast.org collects radiation and other environmental data from all over the world. Data collection effort is crowdsourced by volunteers who are widely distributed geographically. Therefore, as through SafeCast community, the radiation measurement initiative continues to grow globally, this sector is being developed and there is a big amount of data as input for research, the next goal and step is to set new, equally high goals, to broaden the current scientific boundaries and to redefine current activities in a a series of new, interrelated initiatives Furthermore, having therefore a large amount of data and measurements concerning the radiation, we can draw different kinds of conclusions. What we do not know at present is how the nuclear radiation is increasing or decreasing, and what are the ways and means to limit it. Based on the assumption that weather conditions affect the above, we are called upon to investigate and draw conclusions. Beyond that, it could be a model for wider application of the principles of civil science to environmental monitoring, including air, water and climate. 1) Why is it relevant ? 2) Why Apache Spark , Docker , weather data and radiation data introduction can be a motivation background concerning Big-Data technology ? 3) How the Apache Spark data processing engine and algorithms will support the correlation between temperature , wind and radiation Theoretical Background and State of Art LiteraturePOPULAR TOOLS USED FOR WORKING WITH MESSY DATA, CLEANING OF TRANSFORMING IT FROM ONE FORMAT TO ANOTHER Workflow systems have been used for years in scientific studies that combine computer science with other scientific fields, for instance Chemistry, Biology, Generic, Materials Modeling & Simulation, drug discovery, physics etc. and some have even been applied to business informatics and analytics. The structure of the system is composed of different elements that carry data from each other. In addition, it is possible to add items such as titles and comments.There are different workflow systems depending on the needs. In our case, the workflow tools most commonly used are Pipeline Pilot.1, Taverna and KNIME 2. Readers can find details of Discovery Net, Galaxy, Kepler, Triana, SOMA, SMILA,VisTrails, and others on the Web. Kappler has compared Competitive Workflow, Taverna and Pipeline Pilot 3In Pipeline Pilot, users can graphically compose protocols,using hundreds of different configurable ”components” for operations such as data retrieval, manipulation, computational filtering, and display 4.Initially, Pilot Pipeline was mainly used in chemistry. It is now being used in a variety of scientific areas, such as NGS and imaging, because Pipeline Pilot can scale into large-scale development projects like this. The use of Pilot Pipeline does not require programming expertise to introduce new types into the internal code. Users must have a high level of confidence in the security of their corporate collections. Pipeline Pilot is designed to address security issues. Pilot Pipeline is an “expensive” tool with a specialized label element that does not easily respond to wider challenges KNIME initially started as a tool mainly used in chemical chemistry, and with the passage of time began to be used by other industries, making it a tool used by a wide range of contributors from software vendors, academia to banks, telecommunication organizations, pharmaceutical institutes and even from customer relationship managers. Moreover, KNIME has a graphical user interface for combining ”nodes” 5. Collections of nodes are known as ”extensions”. KNIME is based on the Eclipse 6 open source platform, and Java. Java is part of the Pilot Pipeline and includes two types of functions that are designed for developers and non-developers. On the one hand, it is possible to create new components with the Java components API or to register new clients with the Java SDK and on the other hand to provide programming language for both the developers and the non-IT scientists respectively. This is because it supports programming languages but also provides drag and drop activities instead of coding.KNIME relies on the philosophy that data must be an open product and science is part of this process. However, this tool is not suitable for data visualization. KNIME is not sold, it’s free. The business model 7 includes licensing business information that allows users to share workflows and create web portals. Taverna is a common name used for a scientific workflow system comprising Taverna Workbench graphical workflow authoring client, together with SCUFL 8 workflow representation language, and Freefluo 9 enactment engine. Developed by the University of Manchester to be used mainly for scientific reasons. The main goal of the Taverna is the organization of services to a useful data collection to meet tools and databases available on the Internet and serve the needs of bioinformatic technology. Furthermore, another aim is to create scientific workflows with many remote web services. The only control structures that this tool contains are coordinating links and conditional construction. The Taverna tool displays some special features, such as large data sets support, traceability and support for embedded workflows. Workflows can be run on local machines or on a distributed computer infrastructure, for example through cloud technology via the Taverna server. An installation of it the server provides access to a workflow collection. However, in this execution state, users can not edit their published workflows on the server, nor add new workflows to the set of workflows developed on the server 10.Having given the condensed descriptions and features of the most elusive workflow systems, it is also important to provide the limitations and gaps that allow us to work on them in another base. Although the aforementioned systems are easy to use and handle their discretionary skills covering a wide range of functions, they are not suitable for particularly large and complex processes because the spatial capacity is limited for the data set 11. Technical Background and Tools All data are available through SafeCast.org API page and a few search/filter parameters are provided, to filter by time, location, which device is used to capture a particular amount of data, the user ID that uploaded a particular amount of data. Apparently it is possible to download the whole set of data at once as a .csv file. Data is collected, primarily, via the SafeCast.org sensor network. The recommended Geiger counter by SafeCast.org is the bGeigie Nano kit, but there are a lot of other commercially available Geiger counters used for data collection (i.e. Quarta-Rad, DIY Geiger, Ekotest, etc.).(SafeCast API page figure I)The OpenWeatherMap (www.openweathermap.org) is an IT company with practical experience in Big Data and geospatial technologies.His mission is to provide a global geospatial platform which is affordable to users and enables them to operate effortlessly with Earth Observation like satellite imagery, weather data , and similar data sources. With OpenWeatherMap’s platform I can easily build new-data driven products for my project research .The OpenWeatherMap has weather history and forecasts, open source satellite imagery , and other Earth Observation data that are available through unified APIs provided by a high-performance platform.OpenWeatherMap provides:* Current conditions and forecast for 200,000+ cities and any geo location* Historical data* Simple and clear API* Interactive weather and satellite maps* Raw data from 40,000+ weather stations With respect of increasing the usability of the SafeCast.org data, this research project will use the platform BigDataEurope (BDE) for Empowering European researchers and companies with Data Technologies. BigDataEurope (BDE) platform is a coordination and support action funded by the Horizon 2020 program of the European Commission, where VU was a Consortium member.The management and analysis of large-scale datasets- described with the term Big Data- involves the three classic dimensions: volume, velocity and variety. While the former two are well supported by a plethora of software components, the variety dimension is till rather neglected.The BigDataEurope (BDE) platform makes Big Data simpler, cheaper and more flexible than ever before, offering basic building blocks to get started with common Big Data technologies and make integration with other technologies or applications easy. Available blocks are Apache Spark, Hadoop HDFS, Apache Flink and many others. Research efforts are conducted with Smart Big Data, by adding semantics to a Data Lake and performing structured machine learning on semantically structured data.The BigData Integrator (BDI) is an Open Source platform based on Docker, today’s virtualization technique of choice. The base Docker platform is enriched with a layer of services, which support the workflow’s setup, creation and maintenance. BDI can work on a local development machine or scale up to hundreds of nodes connected in a swarm. The platform can be run in- house, or can be hosted by vendors like Amazon Web Services etc.