Proteomics Visualization/integration extended abstract

From VrlWiki
Jump to navigation Jump to search

Integrating Protein Network Visualization Software into a Proteomics Lab Infrastructure

by David Ginsburg

Introduction

In the following document I present the results from my work which integrated a tool that visualizes protein pathways and interactions into Arthur Salomon's proteomics lab, and evaluated the benefits of the integration by getting direct feedback from researchers. Radu Jianu et al. developed a protein pathway exploration tool iteratively with researchers, which visualizes experimental data from phosphorylation experiments in context with known protein interaction networks[1]. Until the present work this tool had not been fully integrated into any lab. The integration allows for the straightforward flow of experimental data into the tool, and establishes connections with proteomics researchers who might use the tool.

The current work has expanded on the tool's initial development by integrating the tool into the infrastructure of a proteomics lab and obtaining additional feedback, which can been used to assess the tool's current functionality. The protein pathway exploration tool visually depicts the interactions between proteins within a pathway of interest and related proteins that exist in the literature that are accessible from the Human Protein Reference Database (HPRD)[2]. The tool was developed in an iterative fashion so that feedback from researchers could steer the direction of development[3]. This has resulted in a tool that better meets the needs of proteomicists in the field.

The process of completing the current work was not without difficulties, and this has provided me with a perspective regarding how the Scientific Visualization group develops software, and what its current infrastructure is lacking. This has led me to detail reflections about my experience to call the group's attention to what I believe to be fundamental mechanisms for software development.

Motivation

Prior to the integration, the protein pathway exploration tool was not relevant for proteomics researchers, because it did not visualize current experimental data. This is because there was no straightforward way to make their most current experimental data available to view in the tool. Specifically, getting experimental data into the tool involved emailing the developer, and the turnaround time was relatively long. This resulted in the tool remaining unused by researchers. Further development of the tool has only been made without active feedback from users using the tool for research. The integration addresses these issues.

Methods

There are two parts to the integration of the protein pathway exploration tool into the Salomon Proteomics Lab. The first part is getting experimental data into the tool in a seamless way. The second part is engaging with researchers directly to teach them how to use the tool and to get their initial feedback.

To allow for easy incorporation of new experimental data into the tool I created a command-line interface for the program that takes in arguments specifying a protein pathway to which experimental data is to be incorporated, an output file to be created, any number of experimental data sets in XML format, and a path to a configuration file to allow some flexibility, and then generates a protein pathway file containing information necessary to visualize the specified pathway with incorporated experimental data. This allows for incorporation of experimental data to be accomplished with a single command.

The main FileMaker interface in the proteomics lab now has a button that generates an XML file containing the latest experimental data set and a batch file to execute this command specifying the appropriate pathway and experiment files. After this button is clicked there is no further human interaction required. Running the program without command-line arguments runs the visual part of the program so that a researcher can analyze the latest data set.

In order to encourage researchers to use the protein pathway exploration tool I conducted a series of one-on-one demonstrations and interviews with researchers from the proteomics lab. During the demonstrations I highlighted software features of which lab members were not aware and clarified questions regarding the software controls. I recorded immediate feedback from researchers by conducting interviews in order to evaluate the effects of the integration and the weaknesses of the current version of the tool.

Results

The main result of my work is that researchers now plan to use the protein pathway exploration tool to augment their current methods of data analysis, whereas before the large majority of researchers working in the Salomon Proteomics Lab had seldom used the software. Every researcher who was interviewed had never used the protein pathway exploration tool more than a few times prior to the integration, and never for the purpose of conducting research, with the exception of Arthur Salomon. After an explanation of the integration and one-on-one demonstrations, every researcher interviewed expressed a willingness to try the integrated software and identified at least one feature of the tool that would be beneficial for his or her own research. Because the tool uses only a single source to provide protein interaction information, and lacks certain filtering capabilities, none of the researchers believed it would completely replace their current methods of data analysis. However, many researchers feel that it has the potential to save time when generating hypotheses from experimental data.

The integration also enhances the proteomics lab's novel automated infrastructure by connecting in a potentially powerful analytical tool. Prior to the integration, the Salomon Proteomics Lab developed a novel automated proteomic software platform to collect, process, and store proteomic datasets[4][5]. The system also performs some post-acquisition tasks, however, prior to the current work, researchers could only view experimental data in the form of a list with associated heat maps within a FileMaker interface. Thus, determining associations between proteins found in experiments and those in the literature was a manual months-long process of querying online databases. The integration of the protein pathway exploration tool finally makes viewing experimental data in the context of known protein interactions possible. Visualizing and filtering these interactions allow proteomics researchers to be make sense of high throughput phosphorylation experiments. According to the researchers interviewed, the current tool, with the proper filtering capabilities, could become integral to this type of data analysis.

By directly engaging with researchers, documenting their concerns, and making available this tool, my work also establishes a feedback loop with this proteomics lab which can help developers improve the tool to better meet the needs of all proteomics researchers. One-on-one software demonstrations and guided use of the latest version of the tool gave researchers an opportunity to provide immediate user feedback on the software, and helped them become familiar with the tool's controls more quickly so they can be more productive immediately when using the tool. This initial feedback allows for an opportunity to be receptive to proteomics researchers' needs, which can, in turn, establish a user base for the tool. Such a user base is a valuable source for ongoing feedback, which can motivate further development of the tool.

Discussion

In this section I will begin by describing some of the issues involved with the integration and the decisions I made. I will proceed by discussing the merits of the protein pathway exploration tool, and then detail researchers' and my own feedback. I will conclude by reflecting on my experience working with the Scientific Visualization group, and my thoughts for the improvement of its development mechanisms and infrastructure.

Issues with Integration

The time frame for completing the integration, and the scope of the integration's evaluation had to be scaled down due to time constraints and unforeseen complications. I only had a single semester to plan, complete, and evaluate the integration. Thus, the initial time frame for completing the integration called for the completion and setup of the software after the eighth week of the semester so that there would be sufficient time to complete an evaluation and perform the second part of the integration which involved engaging with researchers. Unfortunately the code base for the tool was not readily available from source control, the only known way to build the code was by using a specific version of Microsoft Visual Studio, and building required linking in two external libraries. These, and other complications that I will discuss below, delayed the completion of the software part of the integration by at least five weeks. This forced me to scale back the evaluation of the integration. Instead of attempting to use the protein pathway exploration tool to conduct my own research as initially planned, or conducting a comparative study to determine the speed gains associated with using the tool in contrast to manual methods in which I would time researchers, I settled for conducting interviews in which researchers had to estimate and discuss in more theoretical terms the merits of the integration. Additionally, the second part of the integration, which was intended to garner feedback and establish strong contacts among proteomics researchers, became a series of one-on-one software demonstrations with little follow up. Overall, I see my work as an important first step towards producing an integral proteomics analysis tool. Follow up on the part of both developers and users is necessary for the tool's overall success.

Protein Pathway Exploration Tool

The protein pathway exploration tool has the potential to become an integral part of proteomics analysis, however, in its current state, it does not replace current analytical methods. In his interview, Arthur Salomon expressed that the ideas behind the tool and its current implementation have great value, and mentions anecdotally the excitement the tool has incited among proteomicists who have seen him demonstrate it. However researchers in the Salomon Proteomics Lab have pointed out several flaws with the current version of the program that I will detail presently together with my own thoughts regarding the tool. I have ordered this critical feedback such that issues that are most important to address for immediate usefulness are first, and higher-level issues that would be valuable to address in the long run are last.

The current version of the tool does not have adequate filtering capabilities. The protein pathway exploration tool visualizes an immense amount of data, which is visually complex and confusing. The exploration plane feature addresses this somewhat, but there are still proteins with over 100 interactions so that the exploration plane view is still too overwhelming. Filtering capabilities address this issue, however the current filters fall short from what is needed for proteomicists.

There is a need for more filters based on meta data that is currently available in HPRD. Currently the tool displays all proteins that interact with a given protein without regard for the type of cell in which the protein is found. Researchers who are studying proteins found only in liver cells, for example, currently cannot view only the proteins found in liver cells that interact with their proteins of interest. This meta information is currently available in HPRD as "Site of Expression". HPRD also has meta data detailing the "Molecular Class", "Molecular Function", and "Biological Process" of each protein. Being able to filter on all of this meta data would be beneficial to researchers.

The ability to filter the experimental data and have it reflected in the visualization would be very beneficial. Proteomicists in the Salomon Lab currently spend time looking at their experimental data in the form of lists of heat maps to identify proteins with similar changes in expression. Researchers may not know which proteins are interesting to them until after this time-consuming process, and therefore they would not use the protein pathway exploration tool until after this time. If the ability to filter experimental proteins based on their associated heat maps were incorporated into the protein pathway exploration tool, this could encourage them to use the tool for the first step of data analysis.

Creating a plug-in architecture for filters, or a macro language in which to specify filters, would allow the tool to meet the needs of a diversity of proteomics research workflows. Currently every proteomics lab has slightly different data and focuses on different parts of the data. To better meet the needs of a wide range of proteomics labs, it is important to incorporate some flexibility into the tool. Furthermore, if there was a way to add filtering capabilities to the tool in a simple way, proteomics researchers could write their own filters that best meet their own needs and thereby increase the usefulness of the tool.

The current version of the tool does not incorporate all available meta data seamlessly into the interface. If a user selects a protein in the current version of the tool, the only meta data that is displayed is the protein's long name, the number of interactions it has, and a link to its HPRD page. Researchers have mentioned that displaying a list of the proteins with which the given protein interacts would be much more helpful than just the number of interactions (this could be addressed by integrating different parts of the GUI, as I will discuss below). Since the HPRD site is often down, linking to a protein's HPRD page is often useless; however, HPRD contains important meta data. Researchers expressed a desire to see this meta data cached and then displayed in a graceful way (e.g. in well-designed pop-up boxes on mouse-overs).

Some parts of the current version of the tool's GUI are not well integrated, and this makes doing certain tasks difficult. Currently on the left side of the tool's GUI in the "Analyze" tab, there is a Qt Toolbox widget. The Toolbox widget contains a "Protein" page, an "Interaction" page, a "Filters and Selectors" page, and a "Visuals" page. The "Interaction" page is not currently integrated with the visualization, which makes it somewhat useless. When a user clicks on a protein in the visualization and then clicks to view the "Interaction" page, nothing is selected, and therefore no interacting proteins are displayed. When a user clicks on a protein in the list in the "Interaction" page no protein is highlighted in the visualization. Similarly, when a protein is selected from the list in the "Protein" page, which selects the protein in the visualization, nothing is selected in the list in the "Interaction" page. Also, unlike in the "Protein" page, double-clicking on a protein in the list in the "Interaction" page does not bring up the exploration plane. This makes the "Interaction" page somewhat unusable.

Sometimes the tool's GUI makes doing tasks difficult. To fix one prominent example of this, which I detail shortly, the interface should continue to show whatever page a user selected last until the user selects another page. Currently, whenever a user clicks anywhere in the visualization (on a protein or not), the page the user was viewing closes, and the "Protein" page is displayed. This forces the user to click back to the page of interest every time the mouse is clicked anywhere in the visualization. Another similarly frustrating problem is that when the mouse is right-clicked and dragged in the visualization, which is used to pan the view, the page the user was viewing closes, and the "Protein" page is displayed. An additional annoyance is that whatever protein happens to be right-clicked when a user is panning becomes selected and the previous selection is lost. These interface behaviors completely disrupt what the user is doing. They can be very frustrating for users in many cases, such as when a user is selecting proteins to save as a stored selection, when the user is adjusting the visual options for the visualization, or when the user is adjusting filters.

Integrating different parts of the GUI as explained above could provide researchers with a list of proteins that interact with a given protein immediately after they click on the given protein, which is a feature researchers require for the tool to be useful for them. If the interface was designed to display the page that a user is viewing until the user explicitly chooses to view another page, and the "Interaction" page was integrated with the visualization and the list of proteins in the "Protein" page, this would be a solved issue. Supposing these fixes are in place, the "Interaction" page would continue to be displayed while a user selects proteins in the visualization and the user would be able to immediately see the list of proteins that interact with the selected protein, because selected proteins in the visualization would be selected in the list in the "Integration" page. By contrast, if a user of the current version of the tool is interested in seeing a list of interactions for several proteins found in the display, the user must click on each protein, note the protein's ID, scroll through the list of proteins in the "Interaction" panel to find the given ID, and then click on it. This clearly lowers the tool's usability.

The current version of the tool does not have protein interaction information for other organisms besides humans. The tool does not have relevance for proteomicists who are studying proteins in any organism that is not a human, because the tool only uses protein interaction information from HPRD. HPRD documents only proteins found in humans. To address this problem, the tool must integrate information from other sources besides HPRD, as I will discuss below.

The current version of the tool does not incorporate information from other sources besides HPRD. During the phase of research that calls for analyzing the experimental phosphorylation data, researchers rely on submitting manual queries to online databases and tools such as HPRD, Search Tool for the Retrieval of Interacting Genes/Proteins (STRING)[6], PubMed, SciFinder, Scansite, and Google Scholar. These databases contain different meta data for proteins found in different organisms. Some of these are search engines for scholarly literature. Tools such as Scansite are essentially motif-finders that search for sites on proteins likely to be phosphorylated by particular kinases. For the protein pathway exploration tool to be relevant to all proteomicists there must be data about proteins that are found in organisms other than humans, such as proteins found in STRING. Perhaps for a given interaction, search-engine results from a query composed of the two interacting proteins' names along with specified filtering information such as the site of expression could be integrated gracefully, and provide researchers with some of the extra information they currently need to find manually. Overall, until the tool incorporates information from many of these other sources, it will not replace manual methods for data analysis.

The current version of the tool is too protein-centric. Currently the protein pathway exploration tool visualizes the connections between proteins without making any distinctions about which part of the protein is actually interacting. This is a major limitation since in proteomics, researchers are often most interested in a particular domain or phosphorylation site of a protein. Additionally, it is often the case that only specific sites on a given protein interact with other specific domains on another protein. This level of detail is currently not incorporated in the tool's visualization. Adding to the visualization in a graceful way views of protein interactions at the domain and phosphorylation site-level would make the tool even more valuable to proteomicists.

The current version of the tool's GUI needs to be simplified for greater ease of use. Some researchers reported that because of the current GUI design's complexity, the learning curve for the tool is high, and recommended attempting to simplify it. Additionally some GUI elements are used in unintuitive ways. For example, when filtering on protein annotations, the tool provides a slider to select the annotation. Sliders are not generally used for selecting from a collection of discrete choices for which in-between options have no meaning, and so this is confusing for users. Instead, a GUI element like a drop down menu would be more appropriate and intuitive.

The focus of the tool should be on what researchers currently cannot do with just HPRD, and other online tools. As a general comment about the tool, some researchers noted that to make the protein pathway exploration tool more integral to proteomics analysis, it needs to contain features that provide capabilities that are not possible for humans to do beyond simple filtering and visualization. Researchers did not offer specifics.

Reflections

As a new member of the Scientific Visualization group this semester, I experienced many complications that illustrate flaws in the group's development methods and infrastructure. To address these issues, I will describe important mechanisms that I believe the group should consider establishing in a larger role than they currently play.

It is vital to have every project in the group under source control so that new developers on a project can get the latest code immediately after a simple checkout command. As mentioned before, the protein pathway exploration tool's code base was an example of a code base that was not under source control. Therefore the code was not immediately available, and, instead, the latest version of the code resided on a single personal laptop computer. This is a direct reflection of the confusing source control system that was in place when I joined the group. Some group members with whom I worked directly could not explain to me where $G fit in with CVS, nor knew how the group's CVS setup worked, and so it is not astonishing that the code base on which I was going to work was not under source control. This resulted in me being without the code for about three weeks, which would have been acceptable if not for additional complications building the software, which I will discuss presently.

It is important to have a simple build system for every project so that anyone who checks out code can run a "make" command and get the code built immediately. To this end developers should consider minimizing external dependencies in projects, and the group's infrastructure should provide prebuilt common libraries to link to. As mentioned previously, the only way to build the protein pathway exploration tool's code base was by using a specific version of Microsoft Visual Studio, and I had to perform a complete build of an external library several times. In my case, the build process was so complicated and fragile that it took me just under five weeks before I was able to get the build process working, and this was after Radu Jianu rebuilt the code using the latest versions of the external libraries and Visual Studio to make sure we were using identical environments. This large delay forced me to completely rework my time frame for the integration.

The group might consider requiring or providing suggestions for comprehensive project testing plans, such as those involving unit testing. In my experience this semester there was no obvious way to test the output files I was generating, and most testing consisted of "eyeballing" visualizations to make sure they looked acceptable. Instead, if the code base I was working with had built-in unit tests, I would have been able to make sure I was calling functions with appropriate inputs from the moment I started developing. In my case, I had to write most of my code first so that I could look at the generated output to evaluate whether my code was behaving correctly.

It is important for code to be periodically refactored if it is going to be extensible and maintainable, especially for projects developed in an iterative fashion. The protein pathway exploration tool was developed using a form of the spiral model of software development, which involves multiple iterations of prototyping, testing, and feedback loops. The result of such development without refactoring is an unwieldy code base that is difficult to understand and develop. As testament to this fact, some of the last major bug fixes of the software required calls to functions of which I was completely unaware, even though I had been spending weeks with the code. The group might consider encouraging this practice even though it might not seem as efficient in the short-term, because otherwise group members' code essentially dies after the initial developer stops developing it.

Conclusion

The integration of the protein pathway exploration tool into a proteomics lab's infrastructure provides a new way of analyzing proteomic data for the lab, and establishes a small user base for the tool. This is the first step toward developing the tool into software that plays a central role in proteomic data analysis. Before this can be accomplished, however, the code base needs to be refactored and a test plan established. Developers of the tool might even consider open sourcing the tool so that a larger community of developers could make the tool into something larger than what meets the needs of any single proteomics lab.

After working with the Scientific Visualization group and enduring complications and delays of schedule, I believe that there is a need for the group to evaluate the way it develops software, and the infrastructure it provides for such development. Specifically, it would be beneficial for the group to establish core development principles that compromise between wanting code to be maintainable and wanting code to be developed quickly for inclusion in papers. Furthermore, building an infrastructure that incorporates a straightforward version control system, simple building mechanisms, and testing plans would go far in providing developers with the tools they need to make more maintainable and longer-lasting code.

Notes

  1. Jianu R., Laidlaw D.H., Salomon A. Viewing Phosphorylation Experiments Data in the Context of Known Protein Interactions. (manuscript)
  2. Peri S., Navarro J.D., Amanchy R., Kristiansen T.Z., Jonnalagadda C.K., Surendranath V., Niranjan V., Muthusamy B., Gandhi T.K.B., Gronborg M., Ibarrola N., Deshpande N., Shanker K., Shivashankar H.N., Rashmi B.P., Ramya M.A., Zhao Z., Chandrika K.N., Padma N., Harsha H.C., Yatish A.J., Kavitha M.P., Menezes M., Choudhury D.R., Suresh S., Ghosh N., Saravana R., Chandran S., Krishna S., Joy M., Anand S.K., Madavan V., Joseph A., Wong G.W., Schiemann W.P., Constantinescu S.N., Huang L., Khosravi-Far R., Steen H., Tewari M., Ghaffari S., Blobe G.C., Dang C.V., Garcia J.G.N., Pevsner J., Jensen O.N., Roepstorff P., Deshpande K.S., Chinnaiyan A.M., Hamosh A., Chakravarti A., and Pandey A. Development of Human Protein Reference Database as an Initial Platform for Approaching Systems Biology in Humans. Genome Res. 13, 2363-2371, 2003.
  3. Jianu R., Park R., Salomon A., Laidlaw D.H. Viewing proteomic experiements in context with known protein networks. (manuscript)
  4. Yu K., Salomon A. SIPA: System for Integrated Proteomic Analysis. (manuscript)
  5. Yu K., Salomon A. Flexible Relational Database for Visualization of Quantitative Proteomic Data and Integration of Existing Protein Information. (manuscript)
  6. von Mering C., Huynen M., Jaeggi D., Schmidt S., Bork P., and Snel B. STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 31, 258-261, 2003.

References

  • Yu K. SVAT: A Computer assisted MS/MS analysis tool for spectral visualization. (manuscript)