LSST Science Advisory Committee meeting September 25, 2017 Attending: Michael Strauss, Mario Juric, Anze Slosar, Beth Willman, Gordon Richards (acting as proxy for Niel Brandt), Zeljko Ivezic, David Kirkby, Jason Kalirai, Josh Simon, Timo Anguita, Charles Liu, Rachel Bean This one-hour phone conference was focused on the LSST plans for the Science Platform, the set of connected interfaces by which LSST science users will interact with the data. Mario Juric gave a presentation, juric_science_platform.pdf, in which he described these plans in some detail. These are also described in a design document at http://ls.st/lse-319 . The principal questions and recommendations that came out of this discussion are as follows: -We are eager to see a plan describing how Level 3 resources will be allocated in practice. The model whereby users can use substantial LSST resources for carrying out their scientific analyses is an attractive one, but will lead to frustration if people often find themselves limited by the resourses available. -We would also like to see a description of how Level 3 codes and results can be incorporated into Level 2. Making a code robust enough to be run in Level 2 will take real resources from the Project; how will the decision to do so be made? -There will be a real need to match LSST data with external datasets at every waveband. It would be good to have a coherent plan for which datasets will be directly accessible through the LSST science platform, and the extent to which it will be possible, through the platform, to access external databases via VO protocols. -The LSST limits access to those with data right privileges (scientists in the US, Chile, and named external contributors from other countries). How will this work/be enforced if the LSST data are accessible from other databases via the VO protocols? -We are concerned about the long-term future of the JupyterLab platform, and would like reassurance that the LSST would be able to adapt if JupyterLab were to disappear (or the needs of the LSST science community were to change). -We are also concerned about the long-term persistence of user results on the science platform, given that the current plan has data releases no longer being available after several years. -These tools are designed for professional scientists, and are distinct to those meant for the general public in the EPO effort. We do want to make sure that they will be accessible to scientists who are not at R1 universities; we want to avoid barriers for those at smaller colleges, for example. The interface described here is relevant for the Level 2 data releases. These will happen roughly once per year (twice in the first year of operation), and will include catalogs of measured properties of detected objects, as well as calibrated images. One will also be able to access the data from the alerts (Level 1) after they are stuffed into the database, which happens at the end of each night. Access to the alerts as they happen (which will be readied 60 seconds after the end of each visit) will be through the web API (or the event brokers) that are designed to handle them in real time. There are three approaches to accessing the data; think of these as three views of the same underlying dataset, that are designed to allow the user to move seamlessly between them: (1) A data portal, similar to that familiar to users of IRSA and MAST, whereby SQL queries can be entered, the LSST images can be viewed, and simple plots can be made. This is designed to be an exploratory tool, for first looks at the data. A prototype, "Firefly", is already in operation, and has SDSS Stripe82 and NEOWise data loaded (HSC data coming soon). (2) An interactive JupyterLab environment (what Jupyter notebooks are in the process of evolving into), in which more sophisticated queries and analyses of the data can be run (in Python or other languages), using LSST computational resources. Most publishable scientific analyses will happen at this level. The user will have access to the software stack of the LSST image processing pipelines. (3) A series of web APIs whereby larger-scale and more CPU-intensive analyses can be run on external supercomputer systems (including those at NCSA and other resources that one can apply to via the NSF.) Note there are no plans at the moment to have external supercomputer time explicitly allocated to LSST applications. The system is designed to allow users to work and share code and results in groups. It remains an issue of policy to be decided whether people will be able to make their code world-readable. The LSST allots 10% of its compute resources during operations to the facilities running these tools, to allow users to carry out science and create auxiliary data products of their own. This is of order 18 Teraflops, with 4 petabytes of storage. The LSST science platform is scaled for ~7500 users, of whom 100 folks would be on at any given time. The system has the ability to temporarily reassign resources from the data processing cluster (which is 10x bigger than the community resources) when needed at times of peak community demand. Much of our discussion centered on how this 10% would be allocated in practice. The details of this have not been decided. Each user will be given a disk and CPU quota by default, with a mechanism to request/allow for additional allocations when needed. There will also be tools to allow the user to estimate the scope (in disk size or CPU) the query or processing job they are about to run will require. We want to make sure it is difficult to make mistakes! A question: will (unpersisted) scratch space, separate and presumably larger than the individual quotas, be available as well for people to experiment? The system will have the ability to link directly with cloud drives (e.g., dropbox, google drive). The "analysis cores” on which all thsi will be run will be exposed as a batch system to which jobs can be submitted using Pegasus and HTCondor. There are efforts under way to understand whether jobs can be run within Docker containers, to ease deployment of user applications, but no promises can be made at this point. AWS/OpenStack type functionality (the ability of the users to request one or more VMs with certain configuration or properties) is not being planned for. Note that the firefly portal is separate from the tools that will be used in outreach to the general public, as the LSST project does not have the resources to support the large number of users that the public will entail. There will of course be sharing of software tools between those working on the portal and the EPO team, where appropriate. We are concerned that the boundary between academics and the public may not always be clearly defined. We want to make sure that we do not put up barriers for scientists, e.g., at smaller colleges; those not at R1 universities. LSST users would have access to a quota disk space where their queries and results would persist as long as they were needed. One question that comes to mind: the current plan has old data releases lasting for only a few years as new releases make them obselete; does that mean that an individual user's results will disappear when the corresponding data release disappears? The JupyterLab approach is similar to that developed by the database team at JHU (SciServer); similar approaches are being used by the DES and HSC teams, among others. JupyterLab is what Jupyter Notebooks are evolving into, and this technology should be stable at least through the commissioning of LSST. We are somewhat concerned, however, about what will happen on longer timescales; what if the company that supports JupyterLabs folds or goes in a quite different direction 5 years from now?